Inference Endpoints¶

Service ownership

Owner: ai-platform (ai-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11

Hosted model endpoints — autoscaled, OpenAI-compatible API surface, sovereign-resident.

What it is¶

A managed inference service. Bring a model artefact (or pick from the catalogue), get an HTTPS endpoint that speaks the OpenAI Chat Completions / Responses API, scaled to your traffic. Cold-start mitigated by warm-pool minimums.

Why an OpenAI-compatible API¶

It means every existing OpenAI SDK and plugin (LangChain, LlamaIndex, OpenAI Python client, AnythingLLM, Cursor, etc.) works against Cloud Digit endpoints with only an api_base change. No code rewrite. This is the same pattern that made vLLM, Together, Groq, etc. easy to adopt.

python from openai import OpenAI client = OpenAI( api_key="cd-...", base_url="https://inf.bd-dha-1.clouddigit.ai/v1", ) resp = client.chat.completions.create( model="llama-3.1-70b-instruct", messages=[{"role": "user", "content": "Hello, Bangladesh!"}], )

Models¶

The catalogue is curated — open-weight models that we operate at scale, kept current.

Family	Versions
Meta Llama	3.1-8B, 3.1-70B, 3.3-70B
Mistral	Mistral 7B, Mixtral 8x22B, Mistral Large
Qwen	Qwen 2.5 7B / 32B / 72B
Gemma	Gemma 2 9B / 27B
Embeddings	bge-large, e5-mistral-7b-instruct
Bring your own	Any HF Transformers / GGUF / vLLM-supported model

Deployment modes¶

Mode	Use case
Shared, on-demand	Pay per token; multi-tenant; lowest cost for spiky use
Dedicated	Reserved GPU(s); single-tenant; predictable latency
Serverless cold	Scale-to-zero; cold start ~30 s for small models

Throughput and latency¶

For the shared, on-demand pool we publish per-model:

Throughput target (tokens/s)
p50 / p95 / p99 first-token latency
p50 / p95 / p99 inter-token latency

See the status page for current numbers.

Pricing¶

Shared on-demand — per million input tokens + per million output tokens (BDT)
Dedicated — per-GPU-hour (the GPU VMs rate plus a small managed-service margin)

See Pricing.

Compliance¶

Prompts and completions are not logged by default beyond a redacted summary; full-prompt logging is opt-in per endpoint, and any logged data stays in Bangladesh. For regulated workloads, run dedicated endpoints with logging-off.

GPU VMs — what dedicated endpoints run on
LLM-as-a-Service — the Cloud Digit-branded LLM offering on top of this
Vector Database — pair with embedding endpoints for RAG

Operate this service¶

AdministrationOperationTroubleshooting

Managed model-serving endpoints with autoscaling, A/B testing, and integrated quota.

Endpoint types¶

Type	Use	Scale-to-zero
`realtime`	Synchronous HTTP, low-latency	Yes (optional)
`streaming`	Token-streaming for LLM completions	No (warm pool)
`batch`	Async, queue-based, large input	N/A
`serverless-async`	Async, callback-based	Yes

IAM¶

Role	Can do
`inference.viewer`	List endpoints, view metrics
`inference.invoker`	Call endpoints (via API token)
`inference.deployer`	Create / update endpoints
`inference.admin`	Above + quota changes, model registry management

Model registry¶

Endpoints reference a model version in the registry — not a raw image:

bash cd inference model register \ --name acme-summarizer \ --version v1.4 \ --artifact s3://acme-models/summarizer-v1.4/ \ --framework pytorch \ --runtime cuda12.6

The model version is immutable. Deploys reference acme-summarizer:v1.4. Promotion = create a new version, point endpoint at it.

Endpoint config¶

bash cd inference endpoint create \ --name acme-summarize-prod \ --model acme-summarizer:v1.4 \ --instance-type gpu-a10-1x \ --min-replicas 2 --max-replicas 20 \ --target-concurrency 8 \ --timeout 30s

Cost shape¶

Billed per replica-hour while warm. Scale-to-zero possible but cold-start = model load time (often 10–30 s for LLMs).

For latency-sensitive endpoints, min-replicas >= 1.

Audit¶

Every inference call logged (request-id, principal, latency, status). Optional input/output logging (for accuracy debugging) — usually disabled in prod due to PII.

Related¶

Metrics¶

Metric	Healthy	Alert
`inference.requests_per_sec`	varies
`inference.latency_ms` p95	within SLO	breach
`inference.error_rate`	< 0.1%	> 1%
`inference.gpu_utilization_pct`	60–80%	< 30% (over-provisioned) or > 90% (under)
`inference.queue_depth`	< target	spikes (under-capacity)
`inference.replica_count`	scales with load	stuck at max

Autoscaling tuning¶

target-concurrency is the knob. Set to the per-replica throughput at which latency is acceptable:

bash cd inference endpoint update --name acme-summarize-prod --target-concurrency 6

Lower → more replicas, lower latency, more cost. Higher → fewer replicas, higher latency, lower cost.

A/B and canary deploys¶

```bash

Deploy v1.5 alongside v1.4¶

cd inference endpoint update --name acme-summarize-prod \ --traffic-split 'v1.4=90,v1.5=10'

Promote if metrics look good¶

cd inference endpoint update --name acme-summarize-prod \ --traffic-split 'v1.5=100' ```

A/B logs include the model-version label, so dashboards split metrics correctly.

Quantization and optimization¶

For LLMs / vision models, often-applied optimizations:

INT8 / FP8 quantization (TensorRT, vLLM) — 2-4× throughput, minimal accuracy loss
Continuous batching (vLLM, TGI) — 5-10× throughput for LLMs
Speculative decoding — 1.5-2× for some workloads

Cloud Digit ships pre-optimized servers (vLLM, Triton); use those rather than raw model.predict().

Cold start¶

For scale-to-zero or new replica spin-up:

Model load from S3 → GPU: 5–30 s typical (depends on model size)
For instant scale-from-zero, min-replicas >= 1
Snapshot-based fast scale supported on H100 SKUs (sub-second from snapshot)

Quotas¶

GPU inference shares quota with training. Plan separately if both happen in the same project:

bash cd inference quota set --project acme --instance-type gpu-a10-1x --max-replicas 20

Related¶

High p95 latency¶

Symptom	Likely cause
All requests slow	Model load misconfig (CPU instead of GPU)
Most fast, some slow	Queue depth spikes; under-provisioned
Slow during scale-up	Cold start of new replica
Slow only on certain inputs	Input-length-sensitive (LLM with long context)

bash cd inference logs --endpoint acme-summarize-prod --filter "latency>1000ms"

Errors after model deploy¶

Symptoms: 5xx rate spikes right after a new model version went live.

Model artifact corrupted in S3 — check checksum
Framework/runtime mismatch (PyTorch 2.5 model on PyTorch 2.4 server)
Quantization broke (bug in the optimization pipeline)

Roll back:

bash cd inference endpoint update --name acme-summarize-prod \ --traffic-split 'v1.4=100,v1.5=0'

GPU under-utilized despite load¶

Symptom	Cause
GPU util 30% with full queue	Batch size too small; enable continuous batching
GPU util fluctuates 0-100%	Dataloader bottleneck (rare for inference)
GPU util 95% but throughput low	Model is compute-bound — quantize or shard

vLLM, TGI, Triton all support dynamic batching — enable in the server config.

Replica scale-up stuck¶

inference.replica_count not climbing despite high queue:

Max replicas reached
GPU capacity exhausted in the project / region — request quota
Cold start failing (model load fails) — check replica logs

bash cd inference replicas --endpoint acme-summarize-prod

Memory leak¶

Replicas die after N requests:

Memory leak in the model server (vLLM, custom) — find via memory metrics over time
Caching too aggressively (KV cache for LLMs) — tune gpu_memory_utilization
Restart policy: set --max-requests-per-replica 10000 to force periodic restart

Quantized model accuracy regression¶

You quantized a model and quality degraded:

INT4 / FP4 quantization is more aggressive; some models can't tolerate
Calibration dataset wasn't representative
Backbone of the model is quantization-sensitive

Run a quality eval suite on the quantized version before promoting.

API token rejected¶

ERROR 401: invalid token

Token expired (TTL ≤ 90 days)
Token wrong scope (issued for a different project)
Wrong endpoint URL (region mismatch — inference endpoints are regional)