Load Balancer¶
Service ownership
Owner: network-platform (network-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11
Managed L4 (TCP/UDP) and L7 (HTTP/HTTPS) load balancing with health checks, sticky sessions, and integration with Auto Scaling Groups.
What it is¶
Two flavors:
| Flavor | Layer | Protocols | Use case |
|---|---|---|---|
| Network LB | L4 | TCP, UDP, TCP_PROXY | High-throughput, lowest latency, non-HTTP |
| Application LB | L7 | HTTP/1.1, HTTP/2, HTTPS, gRPC | Routing on path/host/headers; TLS termination; WAF integration |
Application LB features¶
- TLS termination — pin a certificate from your store or use Cloud Digit-managed (auto-renewing)
- Path / host routing — route by URL prefix, host header, query, headers, source IP
- Sticky sessions — cookie-based or source-IP-based
- HTTP/2 + gRPC to the backend
- Integration with WAF — attach a WAF policy at the LB
- DDoS protection — every LB is fronted by DDoS Basic; upgrade to Premium per LB
- Access logs to Object Storage or SIEM
Network LB features¶
- Static or pre-warmed — both supported; pre-warm for known traffic spikes
- TCP_PROXY to preserve client IP without HTTP-level termination
- UDP load balancing for game servers, VoIP, custom protocols
- TLS passthrough for SNI-based routing without termination
Health checks¶
| Check | Where it makes sense |
|---|---|
| TCP | Anything, lowest cost |
| HTTP/HTTPS | Web/app servers |
| gRPC | gRPC services |
| Custom (script) | Available on request via support |
Performance¶
| Property | Value |
|---|---|
| Throughput per LB | Up to 100 Gbps |
| Connections per second | 1M (Application LB), 5M (Network LB) |
| Backends per target group | 200 default, 1,000 cap |
Pricing¶
Hourly LB charge + per-LCU (load-balancer-capacity unit, like AWS) + per-GB international egress. Domestic egress over BDIX is free. See Pricing.
Related¶
Operate this service¶
L4 + L7 load balancers with TLS termination, health checks, and target-group routing.
LB types¶
| Type | Layer | Use |
|---|---|---|
lb-net | L4 | TCP/UDP passthrough, raw throughput |
lb-app | L7 | HTTP/HTTPS, path/host-based routing |
lb-gw | L3/L4 | Inline traffic inspection (paired with WAF) |
Pick lb-app for anything that speaks HTTP. lb-net is for non-HTTP (databases, custom protocols).
IAM¶
| Role | Can do |
|---|---|
lb.viewer | List LBs, view metrics |
lb.builder | Create / modify LB, listeners, target groups |
lb.cert-admin | Upload / manage TLS certificates |
lb.admin | Above + delete LBs, modify cross-zone policies |
lb.cert-admin is a separate role because cert mishandling has audit implications.
TLS posture¶
- TLS 1.3 default; TLS 1.2 allowed; TLS 1.0/1.1 disabled.
- Certificates from Cloud Digit ACM (free, auto-renewal) or BYO (PEM upload).
- For regulated workloads: enable mTLS with a customer-controlled CA.
bash cd lb listener create \ --lb acme-web-lb \ --port 443 \ --protocol https \ --tls-policy modern-tls-only \ --cert-arn arn:cd:acm:::cert/abcd
Cross-zone load balancing¶
Default: ON. Distributes traffic across targets in all AZs, not just same-AZ as the request.
Turn OFF for: - Stateful workloads with sticky requirements - Cost optimization (cross-AZ traffic is metered)
Target group hygiene¶
- Health check path:
/health, returning 200 only when ready - Healthy threshold: 2 (avoid flap on single bad sample)
- Unhealthy threshold: 3
- Timeout: > app's worst-case response time
Related¶
Metrics¶
| Metric | Healthy | Alert |
|---|---|---|
lb.request_count_per_target | balanced ± 20% | one target 3× others (placement) |
lb.target_5xx_per_min | 0 | > 0 |
lb.target_response_time_p99 | < 500 ms | > 2 s |
lb.healthy_target_count | matches expected | drops |
lb.tls_handshake_failures | 0 | > 0 (cert / protocol mismatch) |
lb.connection_resets | low | climbing |
Certificate rotation¶
ACM certs auto-renew at 60 days remaining. BYO certs require manual:
```bash cd lb cert upload --cert-pem cert.pem --key-pem key.pem --chain-pem chain.pem cd lb listener update --lb acme-web-lb --port 443 --cert-arn
Brief overlap; old cert stays valid until removed¶
cd lb cert delete --cert-arn
Calendar reminder 30 days before BYO cert expiry. Cert lapse = full outage.
Connection draining¶
Before terminating an instance:
```bash cd lb target deregister --tg
Existing connections served; new connections rejected; default drain 30s¶
```
For long-lived connections (WebSockets, gRPC streams), tune the drain timeout up.
Sticky sessions¶
Either: - Application-controlled — set a cookie, LB respects it (HTTP only) - Duration-based — LB-issued cookie, 1h-24h sticky window - None (default) — round-robin
Avoid stickiness unless you have to. It interferes with scaling and recovery.
Cross-region LB¶
A single LB is region-local. Multi-region failover: DNS-level (health-checked DNS record).
Access logging¶
bash cd lb access-logs enable --lb acme-web-lb --destination s3://acme-lb-logs/
5-minute batched delivery to S3. Combine with a lifecycle rule for retention.
Related¶
Target showing unhealthy¶
Diagnostic order:
- Is the app actually healthy? SSH in, curl the health endpoint
- Health check path — does the LB use
/health(correct) or/(often 302-redirect, fails strict check)? - Health check port — same as the target port?
- Security group — allows LB CIDR on the health-check port?
- Health check response code — LB expects 200 by default; configure to allow others if your app returns 204
```bash cd lb target health --tg
Returns: status, last-check-time, reason for failure¶
```
All targets unhealthy¶
Probably a config drift — health check or SG.
If it happens after a deploy: the new image broke the /health endpoint. Roll back.
If it happens at scale: a downstream dependency (DB, cache) went down — health check rightly fails.
504 Gateway Timeout¶
LB couldn't get a response from target within idle-timeout:
- Increase listener idle-timeout (default 60s; up to 4000s)
- Or fix the slow target (app perf, DB query)
502 Bad Gateway¶
Different from 504. The target accepted the connection then closed unexpectedly:
- App is crashing on request
- App killed by OOM
- TLS handshake fails (LB → target with TLS enabled, target's cert misconfigured)
cd lb access-logs query filters logs for 502 events.
Uneven traffic distribution¶
lb.request_count_per_target shows one target taking 5× others:
- Sticky sessions — disable temporarily to verify
- Cross-zone LB disabled + uneven targets per AZ
- Slow target — fast targets process more requests in the same window. Check
lb.target_response_timeper target.
TLS handshake failures¶
WARN: lb.tls_handshake_failures > 0
Causes: - Client speaks TLS 1.0/1.1 (disabled by default) - Cipher mismatch (rare with modern-tls-only) - SNI mismatch (client sends wrong server name) - Expired cert (verify cd lb listener show)
Detail logs: enable lb.detailed_logging; introspect per-request handshake errors.
ACM cert renewal failed¶
ACM auto-renews at 60 days. Renewal failure:
- Domain ownership changed (DNS validation broke)
- Cert is on a domain no longer pointed at the LB
Logs: console LB → Certificates → Renewal history. Re-validate domain ownership and retry.