Skip to content

LLM-as-a-Service

Service ownership

Owner: application-services (apps-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11

Cloud Digit's branded LLM offering on top of Inference Endpoints — sovereign-resident, BDT-billed, OpenAI-API-compatible.

Why a Cloud Digit LLM service exists

Off-shore LLMs (OpenAI, Anthropic direct, Gemini, Cohere) ship customer data outside Bangladesh. For regulated industries (banks, IDRA-supervised insurers, government workloads) this is often a non-starter. Cloud Digit LLMaaS gives you the same OpenAI-compatible API that your existing LangChain / LlamaIndex / Cursor / app code already expects, but the inference happens on-shore.

What's included over raw Inference Endpoints

Capability Inference Endpoints LLMaaS
OpenAI-compatible API
Curated open-weight model catalogue
Cloud Digit-hosted dashboard (key mgmt, usage)
Per-user / per-team rate limits and quotas
Logging dashboard with redaction controls
Prompt-injection / output filter
Bengali-aware safety classifiers
SSO with your IdP (Azure AD, Keycloak)

LLMaaS is the right pick for "give me an enterprise LLM gateway"; raw Inference Endpoints is the right pick for "I'm building a single product and want raw model access."

Models in the catalogue

See Inference Endpoints — Models. LLMaaS exposes the same set, plus optional fine-tuned variants for Bengali specifically (Llama 3.1 8B-bn, Mistral 7B-bn) on request.

Per-team controls

  • Daily and monthly token caps per team
  • Per-model allow-list per team
  • Audit log per request (redacted by default)
  • Right-to-deletion of logged data on request

Pricing

  • Per million input tokens + per million output tokens (BDT, by model)
  • Volume discounts at committed token-volume tiers

See Pricing.

Operate this service

Cloud Digit hosted LLM endpoints — sovereignty-respecting, BDT-billed, no data leaves Bangladesh.

Available models

Model family Use case
llama-3.x-8b-instruct General chat, fast inference
llama-3.x-70b-instruct Complex reasoning, slower
bangla-llm-7b Bengali-fluent open model
code-llama-13b Code generation
Custom-fine-tuned Customer's fine-tune of any above

All hosted on CD GPU infra, in-country. No data egress to international model providers.

IAM

Role Can do
llm.viewer View available models, usage metrics
llm.invoker Call LLM endpoints
llm.deployer Deploy custom-fine-tuned models
llm.admin Manage quotas, fine-tuning jobs

API compatibility

OpenAI-compatible API. Existing OpenAI SDKs work with base URL change:

python from openai import OpenAI client = OpenAI( api_key=os.environ["CD_LLM_TOKEN"], base_url="https://llm.cloudigit.bd/v1" )

Quota and rate limits

Tier Tokens/min Tokens/day Concurrent requests
llm-dev 10k 100k 4
llm-prod-small 100k 5M 20
llm-prod-large 1M 50M 100
llm-enterprise Negotiated Negotiated Dedicated

Custom fine-tuning

Bring your data, CD trains and hosts:

bash cd llm fine-tune create \ --base-model llama-3.1-8b-instruct \ --training-data s3://acme-llm-data/training.jsonl \ --validation-data s3://acme-llm-data/val.jsonl \ --epochs 3

Fine-tuned model becomes a new endpoint accessible like base models.

Data residency

Inputs and outputs never leave Bangladesh. Training data stays in your CD project. Suitable for BFRS / BB ICT 4.0 / regulated industries.

Metrics

Metric Healthy Alert
llm.tokens_per_sec.aggregate matches load
llm.latency.first_token_ms p95 < 500 > 2000
llm.latency.tokens_per_sec (stream) > 50 < 20
llm.errors_per_min < 0.1% > 1%
llm.quota.used_pct < 80% > 90%
llm.cost_bdt.mtd within budget climbing

Prompt engineering tips for CD-hosted models

  • System prompt: clear instructions, examples (few-shot)
  • Temperature: 0.0-0.3 for deterministic tasks; 0.7+ for creative
  • max_tokens: cap to avoid runaway generations
  • Streaming: enable for chat UX; reduces perceived latency

Throughput optimization

For LLMs, batching > all else: - vLLM (CD's serving framework) does continuous batching automatically - Higher concurrent request count → higher overall throughput (up to concurrency limit) - Lower request rate → lower per-request latency but higher per-token cost

Fine-tuning workflow

  1. Prepare training data (JSONL with prompt/completion pairs)
  2. Validate format: cd llm fine-tune validate-data --file training.jsonl
  3. Estimate cost + duration: cd llm fine-tune estimate ...
  4. Submit job
  5. Monitor: cd llm fine-tune status --job <id>
  6. Deploy: cd llm fine-tune deploy --job <id>

Training time: 1-12 hours typical depending on model size and data volume.

Cost monitoring

Token-based pricing. Daily review:

bash cd llm usage --since 24h --group-by model,project

Common cost drivers: - Repeated prompts (use caching) - Long system prompts (shorten or use system-only finetune) - Excessive max_tokens (cap tightly) - Inefficient retries (single error → multiple retries)

Caching

For repeated similar prompts, enable response caching:

python client.completions.create( model="llama-3.1-8b", messages=[...], extra_headers={"X-CD-Cache": "enable"} )

Cache hit returns cached response (free); miss bills normally.

Quota exceeded

ERROR 429: rate_limit_exceeded

  • Tokens/min hit — implement client-side throttling
  • Tokens/day hit — wait for next reset (UTC midnight) or upgrade tier
  • Concurrent requests hit — reduce parallelism

bash cd llm quota show --token <id>

High latency

llm.latency.first_token_ms p95 > 2 s:

  • Model loading (cold start) — keep model warm with periodic ping
  • Long input prompt (input-tokens-bound — flagship LLMs handle long input but slowly)
  • Inference cluster under-provisioned — request capacity bump

Streaming reduces perceived latency for end users even if total time is unchanged.

Outputs look wrong

Symptom Likely cause
Repetitive / loopy Temperature too low; raise to 0.7
Random / incoherent Temperature too high; lower to 0.3
Doesn't follow instructions System prompt unclear or weak
Uses wrong language Specify in system prompt or use bangla-llm

Fine-tuning failed

ERROR: fine-tune job failed: invalid data format

  • Validate data: each line is valid JSON with required fields
  • Check token-length distribution (very long examples cause OOM)
  • Verify model compatibility (some bases require specific format)

Re-submit with fixes.

Cost spike

llm.cost_bdt.mtd climbing fast: - Audit top consumers: cd llm usage --top-tokens 10 - A bug calling the API in a tight loop - A long system prompt repeated per request - max_tokens not set; model generating to its max

Cap with hard quota:

bash cd llm quota set --token <id> --max-tokens-day 1000000

Model deprecation

CD deprecates older models periodically (security patches, license changes): - 90 days notice - Migration guide to successor - Both old and new run during overlap window

Hallucinations / made-up facts

LLMs hallucinate. Mitigations: - Provide grounding context (RAG with Vector DB) - Explicit "don't make up facts" instruction - Lower temperature (less creative) - Post-process to verify factual claims

No silver bullet; even SOTA models hallucinate.