Frequently Asked Questions
Everything you need to know about the platform.
How do I evaluate the cost of your Generative API?
Start with production traffic, not only average usage. Estimate monthly input tokens and output tokens, then apply model pricing per 1M tokens. Use this baseline formula: monthly cost = (input tokens / 1,000,000 x input rate) + (output tokens / 1,000,000 x output rate). Then add operational factors: peak concurrency, retry rate, cache hit ratio, routing policy, and fallback model usage. For accurate budgeting, run a 7-14 day profiling window with real prompts and separate daytime peaks from background workload.
What does the API Gateway do in this architecture?
The API Gateway is the control plane for requests entering your AI stack. It enforces authentication and authorization, rate limits, quotas, payload validation, and request shaping before traffic reaches model services. It also centralizes observability: request IDs, latency percentiles, token usage, error classes, and per-tenant analytics. In production, it should support idempotency keys, timeout policies, and circuit breaking to protect upstream services during traffic spikes.
Can you integrate a Payment Gateway with SATIM CIB?
Yes. SATIM CIB integration is handled with a secure payment flow, callback verification, and strict transaction reconciliation. Recommended implementation includes server-side signature validation, idempotent payment capture, webhook replay protection, and status polling fallback if callback delivery is delayed. Operationally, you should track authorization, capture, cancellation, and refund states with immutable audit logs for finance and compliance reviews.
How is AI Integration and API exposed for application teams?
AI capabilities are exposed through versioned REST endpoints and streaming interfaces, with clear separation between synchronous inference and asynchronous jobs. Teams typically get environment-scoped API keys, model access policies, and endpoint contracts with backward-compatible versioning. For enterprise integration, include request schemas, webhook contracts, correlation IDs, and deterministic error envelopes so backend and frontend systems can handle failures consistently.
What is an LLM Router or Model Gateway, and why is it critical?
An LLM Router is a policy engine that selects the best model for each request based on cost, latency, quality targets, and task type. It enables smart routing, fallback chains, cost optimization, and prompt caching. Example policy: route standard chat to a lower-cost model, escalate complex reasoning to a premium model, then fallback to a resilient model during saturation events. This architecture reduces spend while maintaining SLA and response quality under variable load.
What does the Agent Runtime or Execution Engine handle?
The Agent Runtime orchestrates tool calls, state transitions, and multi-step task execution. It manages session memory, policy guardrails, retry logic, and deterministic execution boundaries so agents do not drift or loop under failure conditions. In mature deployments, the runtime also enforces tool permissions, timeout budgets, and compensation steps for partial failures, making long-running workflows predictable and auditable.
How does Data Ingestion and Processing work for AI workloads?
Production ingestion pipelines should support both streaming and batch paths. Data is validated against schema, deduplicated, normalized, and optionally anonymized before indexing or feature extraction. For RAG and search pipelines, high-quality chunking, metadata enrichment, and embedding refresh strategy are mandatory to keep retrieval relevant. Use queue-based processing and back-pressure control to prevent ingestion bursts from degrading inference performance.
What is the AI Optimization Engine responsible for?
The AI Optimization Engine continuously improves quality and efficiency by evaluating prompt patterns, model routing outcomes, and token economics. It monitors win-rate between model/prompt variants, detects regression in latency or answer quality, and applies optimization actions such as prompt compression, cache strategy tuning, and routing threshold updates. This is the layer that converts raw AI usage into stable enterprise performance over time.
Why is a CDN or Edge Network layer important in this stack?
The edge layer reduces user-perceived latency and improves resilience by terminating TLS close to users, caching static and semi-static payloads, and absorbing regional traffic spikes. It also protects origin services via shielding, bot control, and traffic filtering before requests hit core compute. For global products, edge routing combined with geo-aware failover materially improves uptime and response consistency.
How is Kubernetes used for container orchestration?
Kubernetes provides workload scheduling, service discovery, health probing, rolling updates, and autoscaling for AI microservices. A robust setup includes namespace isolation, HPA/VPA policies, PodDisruptionBudgets, liveness and readiness probes, and resource quotas per environment. For inference workloads, node pools should be separated by CPU/GPU profile to avoid noisy-neighbor effects and to keep scaling predictable.
What is the Processing Cluster, and when do I need one?
A Processing Cluster is the execution layer for heavy background tasks such as document parsing, embedding generation, feature extraction, fine-tuning preparation, and analytics jobs. You need it when asynchronous workload volume grows beyond what your online inference tier can safely handle. In well-architected systems, it is queue-driven, autoscaled independently, and isolated from user-facing APIs so batch spikes never degrade interactive response times.
What is an AI Inference Platform in production terms?
An AI Inference Platform is the serving layer that exposes models as reliable APIs with strict latency and availability targets. It combines model endpoints, autoscaling, request batching, GPU scheduling, admission control, and safe rollout policies such as canary and blue/green. In enterprise deployments, this layer must also enforce tenant isolation, token accounting, and policy-based model access.
How is Event Bus designed with Kafka, AMQP, and Pub/Sub?
The Event Bus decouples services so producers and consumers scale independently. Kafka is typically used for high-throughput event streams and replayable logs, AMQP for command-style messaging with acknowledgments and routing semantics, and Pub/Sub for fan-out notification patterns. A mature event-driven design includes schema governance, dead-letter queues, retry policies, ordering rules, and consumer lag monitoring.
How do event-driven flows and webhooks work together?
Event-driven architecture handles internal asynchronous workflows, while webhooks provide external system callbacks. Best practice is to publish domain events internally first, then trigger webhook delivery through a dedicated dispatcher with signed payloads, retry backoff, and idempotency keys. This prevents tight coupling and guarantees delivery traceability even if external endpoints are temporarily unavailable.
Why Object Storage for model weights, S3 API, and CDN integration?
Object Storage is the correct persistence layer for large model artifacts, checkpoints, and versioned bundles. S3-compatible APIs simplify tooling interoperability across CI/CD and ML pipelines. CDN in front of artifact distribution improves global fetch latency and reduces origin load, especially during scale-out events when many nodes pull model weights simultaneously.
What observability stack do you recommend: Prometheus, Grafana, and tracing?
Use Prometheus for metrics collection and alerting rules, Grafana for dashboards and SLA/SLO visibility, and distributed tracing for end-to-end latency analysis across gateway, router, runtime, and model backends. The minimum production set should include p95/p99 latency, error budget burn, queue lag, cache hit ratio, token throughput, and trace-based root-cause views for incidents.
Do you provide hosting panels such as cPanel and aaPanel?
Yes. We support common hosting control panels based on workload requirements, including cPanel and aaPanel setups. For production environments, panel choice should follow your operational model: user isolation, backup strategy, update policy, extension ecosystem, and multi-tenant security controls.
Can I change my plan if my circumstances change?
Yes. You can upgrade or downscale according to traffic, budget, and resource profile changes. The recommended process is capacity review, migration window planning, and post-change validation on performance, email/DNS behavior, and backup integrity. This approach keeps service continuity while adapting to new business requirements.