Skip to content

Inference Endpoints

Service ownership

Owner: ai-platform (ai-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11

Hosted model endpoints — autoscaled, OpenAI-compatible API surface, sovereign-resident.

What it is

A managed inference service. Bring a model artefact (or pick from the catalogue), get an HTTPS endpoint that speaks the OpenAI Chat Completions / Responses API, scaled to your traffic. Cold-start mitigated by warm-pool minimums.

Why an OpenAI-compatible API

It means every existing OpenAI SDK and plugin (LangChain, LlamaIndex, OpenAI Python client, AnythingLLM, Cursor, etc.) works against Cloud Digit endpoints with only an api_base change. No code rewrite. This is the same pattern that made vLLM, Together, Groq, etc. easy to adopt.

python from openai import OpenAI client = OpenAI( api_key="cd-...", base_url="https://inf.bd-dha-1.clouddigit.ai/v1", ) resp = client.chat.completions.create( model="llama-3.1-70b-instruct", messages=[{"role": "user", "content": "Hello, Bangladesh!"}], )

Models

The catalogue is curated — open-weight models that we operate at scale, kept current.

Family Versions
Meta Llama 3.1-8B, 3.1-70B, 3.3-70B
Mistral Mistral 7B, Mixtral 8x22B, Mistral Large
Qwen Qwen 2.5 7B / 32B / 72B
Gemma Gemma 2 9B / 27B
Embeddings bge-large, e5-mistral-7b-instruct
Bring your own Any HF Transformers / GGUF / vLLM-supported model

Deployment modes

Mode Use case
Shared, on-demand Pay per token; multi-tenant; lowest cost for spiky use
Dedicated Reserved GPU(s); single-tenant; predictable latency
Serverless cold Scale-to-zero; cold start ~30 s for small models

Throughput and latency

For the shared, on-demand pool we publish per-model:

  • Throughput target (tokens/s)
  • p50 / p95 / p99 first-token latency
  • p50 / p95 / p99 inter-token latency

See the status page for current numbers.

Pricing

  • Shared on-demand — per million input tokens + per million output tokens (BDT)
  • Dedicated — per-GPU-hour (the GPU VMs rate plus a small managed-service margin)

See Pricing.

Compliance

Prompts and completions are not logged by default beyond a redacted summary; full-prompt logging is opt-in per endpoint, and any logged data stays in Bangladesh. For regulated workloads, run dedicated endpoints with logging-off.

Operate this service

Managed model-serving endpoints with autoscaling, A/B testing, and integrated quota.

Endpoint types

Type Use Scale-to-zero
realtime Synchronous HTTP, low-latency Yes (optional)
streaming Token-streaming for LLM completions No (warm pool)
batch Async, queue-based, large input N/A
serverless-async Async, callback-based Yes

IAM

Role Can do
inference.viewer List endpoints, view metrics
inference.invoker Call endpoints (via API token)
inference.deployer Create / update endpoints
inference.admin Above + quota changes, model registry management

Model registry

Endpoints reference a model version in the registry — not a raw image:

bash cd inference model register \ --name acme-summarizer \ --version v1.4 \ --artifact s3://acme-models/summarizer-v1.4/ \ --framework pytorch \ --runtime cuda12.6

The model version is immutable. Deploys reference acme-summarizer:v1.4. Promotion = create a new version, point endpoint at it.

Endpoint config

bash cd inference endpoint create \ --name acme-summarize-prod \ --model acme-summarizer:v1.4 \ --instance-type gpu-a10-1x \ --min-replicas 2 --max-replicas 20 \ --target-concurrency 8 \ --timeout 30s

Cost shape

Billed per replica-hour while warm. Scale-to-zero possible but cold-start = model load time (often 10–30 s for LLMs).

For latency-sensitive endpoints, min-replicas >= 1.

Audit

Every inference call logged (request-id, principal, latency, status). Optional input/output logging (for accuracy debugging) — usually disabled in prod due to PII.

Metrics

Metric Healthy Alert
inference.requests_per_sec varies
inference.latency_ms p95 within SLO breach
inference.error_rate < 0.1% > 1%
inference.gpu_utilization_pct 60–80% < 30% (over-provisioned) or > 90% (under)
inference.queue_depth < target spikes (under-capacity)
inference.replica_count scales with load stuck at max

Autoscaling tuning

target-concurrency is the knob. Set to the per-replica throughput at which latency is acceptable:

bash cd inference endpoint update --name acme-summarize-prod --target-concurrency 6

Lower → more replicas, lower latency, more cost. Higher → fewer replicas, higher latency, lower cost.

A/B and canary deploys

```bash

Deploy v1.5 alongside v1.4

cd inference endpoint update --name acme-summarize-prod \ --traffic-split 'v1.4=90,v1.5=10'

Promote if metrics look good

cd inference endpoint update --name acme-summarize-prod \ --traffic-split 'v1.5=100' ```

A/B logs include the model-version label, so dashboards split metrics correctly.

Quantization and optimization

For LLMs / vision models, often-applied optimizations:

  • INT8 / FP8 quantization (TensorRT, vLLM) — 2-4× throughput, minimal accuracy loss
  • Continuous batching (vLLM, TGI) — 5-10× throughput for LLMs
  • Speculative decoding — 1.5-2× for some workloads

Cloud Digit ships pre-optimized servers (vLLM, Triton); use those rather than raw model.predict().

Cold start

For scale-to-zero or new replica spin-up:

  • Model load from S3 → GPU: 5–30 s typical (depends on model size)
  • For instant scale-from-zero, min-replicas >= 1
  • Snapshot-based fast scale supported on H100 SKUs (sub-second from snapshot)

Quotas

GPU inference shares quota with training. Plan separately if both happen in the same project:

bash cd inference quota set --project acme --instance-type gpu-a10-1x --max-replicas 20

High p95 latency

Symptom Likely cause
All requests slow Model load misconfig (CPU instead of GPU)
Most fast, some slow Queue depth spikes; under-provisioned
Slow during scale-up Cold start of new replica
Slow only on certain inputs Input-length-sensitive (LLM with long context)

bash cd inference logs --endpoint acme-summarize-prod --filter "latency>1000ms"

Errors after model deploy

Symptoms: 5xx rate spikes right after a new model version went live.

  • Model artifact corrupted in S3 — check checksum
  • Framework/runtime mismatch (PyTorch 2.5 model on PyTorch 2.4 server)
  • Quantization broke (bug in the optimization pipeline)

Roll back:

bash cd inference endpoint update --name acme-summarize-prod \ --traffic-split 'v1.4=100,v1.5=0'

GPU under-utilized despite load

Symptom Cause
GPU util 30% with full queue Batch size too small; enable continuous batching
GPU util fluctuates 0-100% Dataloader bottleneck (rare for inference)
GPU util 95% but throughput low Model is compute-bound — quantize or shard

vLLM, TGI, Triton all support dynamic batching — enable in the server config.

Replica scale-up stuck

inference.replica_count not climbing despite high queue:

  • Max replicas reached
  • GPU capacity exhausted in the project / region — request quota
  • Cold start failing (model load fails) — check replica logs

bash cd inference replicas --endpoint acme-summarize-prod

Memory leak

Replicas die after N requests:

  • Memory leak in the model server (vLLM, custom) — find via memory metrics over time
  • Caching too aggressively (KV cache for LLMs) — tune gpu_memory_utilization
  • Restart policy: set --max-requests-per-replica 10000 to force periodic restart

Quantized model accuracy regression

You quantized a model and quality degraded:

  • INT4 / FP4 quantization is more aggressive; some models can't tolerate
  • Calibration dataset wasn't representative
  • Backbone of the model is quantization-sensitive

Run a quality eval suite on the quantized version before promoting.

API token rejected

ERROR 401: invalid token

  • Token expired (TTL ≤ 90 days)
  • Token wrong scope (issued for a different project)
  • Wrong endpoint URL (region mismatch — inference endpoints are regional)