Inference Endpoints¶
Service ownership
Owner: ai-platform (ai-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11
Hosted model endpoints — autoscaled, OpenAI-compatible API surface, sovereign-resident.
What it is¶
A managed inference service. Bring a model artefact (or pick from the catalogue), get an HTTPS endpoint that speaks the OpenAI Chat Completions / Responses API, scaled to your traffic. Cold-start mitigated by warm-pool minimums.
Why an OpenAI-compatible API¶
It means every existing OpenAI SDK and plugin (LangChain, LlamaIndex, OpenAI Python client, AnythingLLM, Cursor, etc.) works against Cloud Digit endpoints with only an api_base change. No code rewrite. This is the same pattern that made vLLM, Together, Groq, etc. easy to adopt.
python from openai import OpenAI client = OpenAI( api_key="cd-...", base_url="https://inf.bd-dha-1.clouddigit.ai/v1", ) resp = client.chat.completions.create( model="llama-3.1-70b-instruct", messages=[{"role": "user", "content": "Hello, Bangladesh!"}], )
Models¶
The catalogue is curated — open-weight models that we operate at scale, kept current.
| Family | Versions |
|---|---|
| Meta Llama | 3.1-8B, 3.1-70B, 3.3-70B |
| Mistral | Mistral 7B, Mixtral 8x22B, Mistral Large |
| Qwen | Qwen 2.5 7B / 32B / 72B |
| Gemma | Gemma 2 9B / 27B |
| Embeddings | bge-large, e5-mistral-7b-instruct |
| Bring your own | Any HF Transformers / GGUF / vLLM-supported model |
Deployment modes¶
| Mode | Use case |
|---|---|
| Shared, on-demand | Pay per token; multi-tenant; lowest cost for spiky use |
| Dedicated | Reserved GPU(s); single-tenant; predictable latency |
| Serverless cold | Scale-to-zero; cold start ~30 s for small models |
Throughput and latency¶
For the shared, on-demand pool we publish per-model:
- Throughput target (tokens/s)
- p50 / p95 / p99 first-token latency
- p50 / p95 / p99 inter-token latency
See the status page for current numbers.
Pricing¶
- Shared on-demand — per million input tokens + per million output tokens (BDT)
- Dedicated — per-GPU-hour (the GPU VMs rate plus a small managed-service margin)
See Pricing.
Compliance¶
Prompts and completions are not logged by default beyond a redacted summary; full-prompt logging is opt-in per endpoint, and any logged data stays in Bangladesh. For regulated workloads, run dedicated endpoints with logging-off.
Related¶
- GPU VMs — what dedicated endpoints run on
- LLM-as-a-Service — the Cloud Digit-branded LLM offering on top of this
- Vector Database — pair with embedding endpoints for RAG
Operate this service¶
Managed model-serving endpoints with autoscaling, A/B testing, and integrated quota.
Endpoint types¶
| Type | Use | Scale-to-zero |
|---|---|---|
realtime | Synchronous HTTP, low-latency | Yes (optional) |
streaming | Token-streaming for LLM completions | No (warm pool) |
batch | Async, queue-based, large input | N/A |
serverless-async | Async, callback-based | Yes |
IAM¶
| Role | Can do |
|---|---|
inference.viewer | List endpoints, view metrics |
inference.invoker | Call endpoints (via API token) |
inference.deployer | Create / update endpoints |
inference.admin | Above + quota changes, model registry management |
Model registry¶
Endpoints reference a model version in the registry — not a raw image:
bash cd inference model register \ --name acme-summarizer \ --version v1.4 \ --artifact s3://acme-models/summarizer-v1.4/ \ --framework pytorch \ --runtime cuda12.6
The model version is immutable. Deploys reference acme-summarizer:v1.4. Promotion = create a new version, point endpoint at it.
Endpoint config¶
bash cd inference endpoint create \ --name acme-summarize-prod \ --model acme-summarizer:v1.4 \ --instance-type gpu-a10-1x \ --min-replicas 2 --max-replicas 20 \ --target-concurrency 8 \ --timeout 30s
Cost shape¶
Billed per replica-hour while warm. Scale-to-zero possible but cold-start = model load time (often 10–30 s for LLMs).
For latency-sensitive endpoints, min-replicas >= 1.
Audit¶
Every inference call logged (request-id, principal, latency, status). Optional input/output logging (for accuracy debugging) — usually disabled in prod due to PII.
Related¶
Metrics¶
| Metric | Healthy | Alert |
|---|---|---|
inference.requests_per_sec | varies | |
inference.latency_ms p95 | within SLO | breach |
inference.error_rate | < 0.1% | > 1% |
inference.gpu_utilization_pct | 60–80% | < 30% (over-provisioned) or > 90% (under) |
inference.queue_depth | < target | spikes (under-capacity) |
inference.replica_count | scales with load | stuck at max |
Autoscaling tuning¶
target-concurrency is the knob. Set to the per-replica throughput at which latency is acceptable:
bash cd inference endpoint update --name acme-summarize-prod --target-concurrency 6
Lower → more replicas, lower latency, more cost. Higher → fewer replicas, higher latency, lower cost.
A/B and canary deploys¶
```bash
Deploy v1.5 alongside v1.4¶
cd inference endpoint update --name acme-summarize-prod \ --traffic-split 'v1.4=90,v1.5=10'
Promote if metrics look good¶
cd inference endpoint update --name acme-summarize-prod \ --traffic-split 'v1.5=100' ```
A/B logs include the model-version label, so dashboards split metrics correctly.
Quantization and optimization¶
For LLMs / vision models, often-applied optimizations:
- INT8 / FP8 quantization (TensorRT, vLLM) — 2-4× throughput, minimal accuracy loss
- Continuous batching (vLLM, TGI) — 5-10× throughput for LLMs
- Speculative decoding — 1.5-2× for some workloads
Cloud Digit ships pre-optimized servers (vLLM, Triton); use those rather than raw model.predict().
Cold start¶
For scale-to-zero or new replica spin-up:
- Model load from S3 → GPU: 5–30 s typical (depends on model size)
- For instant scale-from-zero,
min-replicas >= 1 - Snapshot-based fast scale supported on H100 SKUs (sub-second from snapshot)
Quotas¶
GPU inference shares quota with training. Plan separately if both happen in the same project:
bash cd inference quota set --project acme --instance-type gpu-a10-1x --max-replicas 20
Related¶
High p95 latency¶
| Symptom | Likely cause |
|---|---|
| All requests slow | Model load misconfig (CPU instead of GPU) |
| Most fast, some slow | Queue depth spikes; under-provisioned |
| Slow during scale-up | Cold start of new replica |
| Slow only on certain inputs | Input-length-sensitive (LLM with long context) |
bash cd inference logs --endpoint acme-summarize-prod --filter "latency>1000ms"
Errors after model deploy¶
Symptoms: 5xx rate spikes right after a new model version went live.
- Model artifact corrupted in S3 — check checksum
- Framework/runtime mismatch (PyTorch 2.5 model on PyTorch 2.4 server)
- Quantization broke (bug in the optimization pipeline)
Roll back:
bash cd inference endpoint update --name acme-summarize-prod \ --traffic-split 'v1.4=100,v1.5=0'
GPU under-utilized despite load¶
| Symptom | Cause |
|---|---|
| GPU util 30% with full queue | Batch size too small; enable continuous batching |
| GPU util fluctuates 0-100% | Dataloader bottleneck (rare for inference) |
| GPU util 95% but throughput low | Model is compute-bound — quantize or shard |
vLLM, TGI, Triton all support dynamic batching — enable in the server config.
Replica scale-up stuck¶
inference.replica_count not climbing despite high queue:
- Max replicas reached
- GPU capacity exhausted in the project / region — request quota
- Cold start failing (model load fails) — check replica logs
bash cd inference replicas --endpoint acme-summarize-prod
Memory leak¶
Replicas die after N requests:
- Memory leak in the model server (vLLM, custom) — find via memory metrics over time
- Caching too aggressively (KV cache for LLMs) — tune
gpu_memory_utilization - Restart policy: set
--max-requests-per-replica 10000to force periodic restart
Quantized model accuracy regression¶
You quantized a model and quality degraded:
- INT4 / FP4 quantization is more aggressive; some models can't tolerate
- Calibration dataset wasn't representative
- Backbone of the model is quantization-sensitive
Run a quality eval suite on the quantized version before promoting.
API token rejected¶
ERROR 401: invalid token
- Token expired (TTL ≤ 90 days)
- Token wrong scope (issued for a different project)
- Wrong endpoint URL (region mismatch — inference endpoints are regional)