LLM-as-a-Service¶
Service ownership
Owner: application-services (apps-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11
Cloud Digit's branded LLM offering on top of Inference Endpoints — sovereign-resident, BDT-billed, OpenAI-API-compatible.
Why a Cloud Digit LLM service exists¶
Off-shore LLMs (OpenAI, Anthropic direct, Gemini, Cohere) ship customer data outside Bangladesh. For regulated industries (banks, IDRA-supervised insurers, government workloads) this is often a non-starter. Cloud Digit LLMaaS gives you the same OpenAI-compatible API that your existing LangChain / LlamaIndex / Cursor / app code already expects, but the inference happens on-shore.
What's included over raw Inference Endpoints¶
| Capability | Inference Endpoints | LLMaaS |
|---|---|---|
| OpenAI-compatible API | ✓ | ✓ |
| Curated open-weight model catalogue | ✓ | ✓ |
| Cloud Digit-hosted dashboard (key mgmt, usage) | ✓ | |
| Per-user / per-team rate limits and quotas | ✓ | |
| Logging dashboard with redaction controls | ✓ | |
| Prompt-injection / output filter | ✓ | |
| Bengali-aware safety classifiers | ✓ | |
| SSO with your IdP (Azure AD, Keycloak) | ✓ |
LLMaaS is the right pick for "give me an enterprise LLM gateway"; raw Inference Endpoints is the right pick for "I'm building a single product and want raw model access."
Models in the catalogue¶
See Inference Endpoints — Models. LLMaaS exposes the same set, plus optional fine-tuned variants for Bengali specifically (Llama 3.1 8B-bn, Mistral 7B-bn) on request.
Per-team controls¶
- Daily and monthly token caps per team
- Per-model allow-list per team
- Audit log per request (redacted by default)
- Right-to-deletion of logged data on request
Pricing¶
- Per million input tokens + per million output tokens (BDT, by model)
- Volume discounts at committed token-volume tiers
See Pricing.
Related¶
Operate this service¶
Cloud Digit hosted LLM endpoints — sovereignty-respecting, BDT-billed, no data leaves Bangladesh.
Available models¶
| Model family | Use case |
|---|---|
llama-3.x-8b-instruct | General chat, fast inference |
llama-3.x-70b-instruct | Complex reasoning, slower |
bangla-llm-7b | Bengali-fluent open model |
code-llama-13b | Code generation |
| Custom-fine-tuned | Customer's fine-tune of any above |
All hosted on CD GPU infra, in-country. No data egress to international model providers.
IAM¶
| Role | Can do |
|---|---|
llm.viewer | View available models, usage metrics |
llm.invoker | Call LLM endpoints |
llm.deployer | Deploy custom-fine-tuned models |
llm.admin | Manage quotas, fine-tuning jobs |
API compatibility¶
OpenAI-compatible API. Existing OpenAI SDKs work with base URL change:
python from openai import OpenAI client = OpenAI( api_key=os.environ["CD_LLM_TOKEN"], base_url="https://llm.cloudigit.bd/v1" )
Quota and rate limits¶
| Tier | Tokens/min | Tokens/day | Concurrent requests |
|---|---|---|---|
llm-dev | 10k | 100k | 4 |
llm-prod-small | 100k | 5M | 20 |
llm-prod-large | 1M | 50M | 100 |
llm-enterprise | Negotiated | Negotiated | Dedicated |
Custom fine-tuning¶
Bring your data, CD trains and hosts:
bash cd llm fine-tune create \ --base-model llama-3.1-8b-instruct \ --training-data s3://acme-llm-data/training.jsonl \ --validation-data s3://acme-llm-data/val.jsonl \ --epochs 3
Fine-tuned model becomes a new endpoint accessible like base models.
Data residency¶
Inputs and outputs never leave Bangladesh. Training data stays in your CD project. Suitable for BFRS / BB ICT 4.0 / regulated industries.
Related¶
Metrics¶
| Metric | Healthy | Alert |
|---|---|---|
llm.tokens_per_sec.aggregate | matches load | |
llm.latency.first_token_ms p95 | < 500 | > 2000 |
llm.latency.tokens_per_sec (stream) | > 50 | < 20 |
llm.errors_per_min | < 0.1% | > 1% |
llm.quota.used_pct | < 80% | > 90% |
llm.cost_bdt.mtd | within budget | climbing |
Prompt engineering tips for CD-hosted models¶
- System prompt: clear instructions, examples (few-shot)
- Temperature: 0.0-0.3 for deterministic tasks; 0.7+ for creative
max_tokens: cap to avoid runaway generations- Streaming: enable for chat UX; reduces perceived latency
Throughput optimization¶
For LLMs, batching > all else: - vLLM (CD's serving framework) does continuous batching automatically - Higher concurrent request count → higher overall throughput (up to concurrency limit) - Lower request rate → lower per-request latency but higher per-token cost
Fine-tuning workflow¶
- Prepare training data (JSONL with prompt/completion pairs)
- Validate format:
cd llm fine-tune validate-data --file training.jsonl - Estimate cost + duration:
cd llm fine-tune estimate ... - Submit job
- Monitor:
cd llm fine-tune status --job <id> - Deploy:
cd llm fine-tune deploy --job <id>
Training time: 1-12 hours typical depending on model size and data volume.
Cost monitoring¶
Token-based pricing. Daily review:
bash cd llm usage --since 24h --group-by model,project
Common cost drivers: - Repeated prompts (use caching) - Long system prompts (shorten or use system-only finetune) - Excessive max_tokens (cap tightly) - Inefficient retries (single error → multiple retries)
Caching¶
For repeated similar prompts, enable response caching:
python client.completions.create( model="llama-3.1-8b", messages=[...], extra_headers={"X-CD-Cache": "enable"} )
Cache hit returns cached response (free); miss bills normally.
Related¶
Quota exceeded¶
ERROR 429: rate_limit_exceeded
- Tokens/min hit — implement client-side throttling
- Tokens/day hit — wait for next reset (UTC midnight) or upgrade tier
- Concurrent requests hit — reduce parallelism
bash cd llm quota show --token <id>
High latency¶
llm.latency.first_token_ms p95 > 2 s:
- Model loading (cold start) — keep model warm with periodic ping
- Long input prompt (input-tokens-bound — flagship LLMs handle long input but slowly)
- Inference cluster under-provisioned — request capacity bump
Streaming reduces perceived latency for end users even if total time is unchanged.
Outputs look wrong¶
| Symptom | Likely cause |
|---|---|
| Repetitive / loopy | Temperature too low; raise to 0.7 |
| Random / incoherent | Temperature too high; lower to 0.3 |
| Doesn't follow instructions | System prompt unclear or weak |
| Uses wrong language | Specify in system prompt or use bangla-llm |
Fine-tuning failed¶
ERROR: fine-tune job failed: invalid data format
- Validate data: each line is valid JSON with required fields
- Check token-length distribution (very long examples cause OOM)
- Verify model compatibility (some bases require specific format)
Re-submit with fixes.
Cost spike¶
llm.cost_bdt.mtd climbing fast: - Audit top consumers: cd llm usage --top-tokens 10 - A bug calling the API in a tight loop - A long system prompt repeated per request - max_tokens not set; model generating to its max
Cap with hard quota:
bash cd llm quota set --token <id> --max-tokens-day 1000000
Model deprecation¶
CD deprecates older models periodically (security patches, license changes): - 90 days notice - Migration guide to successor - Both old and new run during overlap window
Hallucinations / made-up facts¶
LLMs hallucinate. Mitigations: - Provide grounding context (RAG with Vector DB) - Explicit "don't make up facts" instruction - Lower temperature (less creative) - Post-process to verify factual claims
No silver bullet; even SOTA models hallucinate.