LLM-as-a-Service¶

Service ownership

Owner: application-services (apps-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11

Cloud Digit's branded LLM offering on top of Inference Endpoints — sovereign-resident, BDT-billed, OpenAI-API-compatible.

Why a Cloud Digit LLM service exists¶

Off-shore LLMs (OpenAI, Anthropic direct, Gemini, Cohere) ship customer data outside Bangladesh. For regulated industries (banks, IDRA-supervised insurers, government workloads) this is often a non-starter. Cloud Digit LLMaaS gives you the same OpenAI-compatible API that your existing LangChain / LlamaIndex / Cursor / app code already expects, but the inference happens on-shore.

What's included over raw Inference Endpoints ¶

Capability	Inference Endpoints	LLMaaS
OpenAI-compatible API	✓	✓
Curated open-weight model catalogue	✓	✓
Cloud Digit-hosted dashboard (key mgmt, usage)		✓
Per-user / per-team rate limits and quotas		✓
Logging dashboard with redaction controls		✓
Prompt-injection / output filter		✓
Bengali-aware safety classifiers		✓
SSO with your IdP (Azure AD, Keycloak)		✓

LLMaaS is the right pick for "give me an enterprise LLM gateway"; raw Inference Endpoints is the right pick for "I'm building a single product and want raw model access."

Models in the catalogue¶

See Inference Endpoints — Models. LLMaaS exposes the same set, plus optional fine-tuned variants for Bengali specifically (Llama 3.1 8B-bn, Mistral 7B-bn) on request.

Per-team controls¶

Daily and monthly token caps per team
Per-model allow-list per team
Audit log per request (redacted by default)
Right-to-deletion of logged data on request

Pricing¶

Per million input tokens + per million output tokens (BDT, by model)
Volume discounts at committed token-volume tiers

See Pricing.

Operate this service¶

AdministrationOperationTroubleshooting

Cloud Digit hosted LLM endpoints — sovereignty-respecting, BDT-billed, no data leaves Bangladesh.

Available models¶

Model family	Use case
`llama-3.x-8b-instruct`	General chat, fast inference
`llama-3.x-70b-instruct`	Complex reasoning, slower
`bangla-llm-7b`	Bengali-fluent open model
`code-llama-13b`	Code generation
Custom-fine-tuned	Customer's fine-tune of any above

All hosted on CD GPU infra, in-country. No data egress to international model providers.

IAM¶

Role	Can do
`llm.viewer`	View available models, usage metrics
`llm.invoker`	Call LLM endpoints
`llm.deployer`	Deploy custom-fine-tuned models
`llm.admin`	Manage quotas, fine-tuning jobs

API compatibility¶

OpenAI-compatible API. Existing OpenAI SDKs work with base URL change:

python from openai import OpenAI client = OpenAI( api_key=os.environ["CD_LLM_TOKEN"], base_url="https://llm.cloudigit.bd/v1" )

Quota and rate limits¶

Tier	Tokens/min	Tokens/day	Concurrent requests
`llm-dev`	10k	100k	4
`llm-prod-small`	100k	5M	20
`llm-prod-large`	1M	50M	100
`llm-enterprise`	Negotiated	Negotiated	Dedicated

Custom fine-tuning¶

Bring your data, CD trains and hosts:

bash cd llm fine-tune create \ --base-model llama-3.1-8b-instruct \ --training-data s3://acme-llm-data/training.jsonl \ --validation-data s3://acme-llm-data/val.jsonl \ --epochs 3

Fine-tuned model becomes a new endpoint accessible like base models.

Data residency¶

Inputs and outputs never leave Bangladesh. Training data stays in your CD project. Suitable for BFRS / BB ICT 4.0 / regulated industries.

Related¶

Metrics¶

Metric	Healthy	Alert
`llm.tokens_per_sec.aggregate`	matches load
`llm.latency.first_token_ms` p95	< 500	> 2000
`llm.latency.tokens_per_sec` (stream)	> 50	< 20
`llm.errors_per_min`	< 0.1%	> 1%
`llm.quota.used_pct`	< 80%	> 90%
`llm.cost_bdt.mtd`	within budget	climbing

Prompt engineering tips for CD-hosted models¶

System prompt: clear instructions, examples (few-shot)
Temperature: 0.0-0.3 for deterministic tasks; 0.7+ for creative
max_tokens: cap to avoid runaway generations
Streaming: enable for chat UX; reduces perceived latency

Throughput optimization¶

For LLMs, batching > all else: - vLLM (CD's serving framework) does continuous batching automatically - Higher concurrent request count → higher overall throughput (up to concurrency limit) - Lower request rate → lower per-request latency but higher per-token cost

Fine-tuning workflow¶

Prepare training data (JSONL with prompt/completion pairs)
Validate format: cd llm fine-tune validate-data --file training.jsonl
Estimate cost + duration: cd llm fine-tune estimate ...
Submit job
Monitor: cd llm fine-tune status --job <id>
Deploy: cd llm fine-tune deploy --job <id>

Training time: 1-12 hours typical depending on model size and data volume.

Cost monitoring¶

Token-based pricing. Daily review:

bash cd llm usage --since 24h --group-by model,project

Common cost drivers: - Repeated prompts (use caching) - Long system prompts (shorten or use system-only finetune) - Excessive max_tokens (cap tightly) - Inefficient retries (single error → multiple retries)

Caching¶

For repeated similar prompts, enable response caching:

python client.completions.create( model="llama-3.1-8b", messages=[...], extra_headers={"X-CD-Cache": "enable"} )

Cache hit returns cached response (free); miss bills normally.

Related¶

Quota exceeded¶

ERROR 429: rate_limit_exceeded

Tokens/min hit — implement client-side throttling
Tokens/day hit — wait for next reset (UTC midnight) or upgrade tier
Concurrent requests hit — reduce parallelism

bash cd llm quota show --token <id>

High latency¶

llm.latency.first_token_ms p95 > 2 s:

Model loading (cold start) — keep model warm with periodic ping
Long input prompt (input-tokens-bound — flagship LLMs handle long input but slowly)
Inference cluster under-provisioned — request capacity bump

Streaming reduces perceived latency for end users even if total time is unchanged.

Outputs look wrong¶

Symptom	Likely cause
Repetitive / loopy	Temperature too low; raise to 0.7
Random / incoherent	Temperature too high; lower to 0.3
Doesn't follow instructions	System prompt unclear or weak
Uses wrong language	Specify in system prompt or use bangla-llm

Fine-tuning failed¶

ERROR: fine-tune job failed: invalid data format

Validate data: each line is valid JSON with required fields
Check token-length distribution (very long examples cause OOM)
Verify model compatibility (some bases require specific format)

Re-submit with fixes.

Cost spike¶

llm.cost_bdt.mtd climbing fast: - Audit top consumers: cd llm usage --top-tokens 10 - A bug calling the API in a tight loop - A long system prompt repeated per request - max_tokens not set; model generating to its max

Cap with hard quota:

bash cd llm quota set --token <id> --max-tokens-day 1000000

Model deprecation¶

CD deprecates older models periodically (security patches, license changes): - 90 days notice - Migration guide to successor - Both old and new run during overlap window

Hallucinations / made-up facts¶

LLMs hallucinate. Mitigations: - Provide grounding context (RAG with Vector DB) - Explicit "don't make up facts" instruction - Lower temperature (less creative) - Post-process to verify factual claims

No silver bullet; even SOTA models hallucinate.

LLM-as-a-Service¶

Why a Cloud Digit LLM service exists¶

What's included over raw Inference Endpoints¶

Models in the catalogue¶

Per-team controls¶

Pricing¶

Related¶

Operate this service¶

Available models¶

IAM¶

API compatibility¶

Quota and rate limits¶

Custom fine-tuning¶

Data residency¶

Related¶

Metrics¶

Prompt engineering tips for CD-hosted models¶

Throughput optimization¶

Fine-tuning workflow¶

Cost monitoring¶

Caching¶

Related¶

Quota exceeded¶

High latency¶

Outputs look wrong¶

Fine-tuning failed¶

Cost spike¶

Model deprecation¶

Hallucinations / made-up facts¶

Related¶

What's included over raw Inference Endpoints ¶