Skip to content

Auto Scaling Groups

Service ownership

Owner: compute-platform (compute-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11

Policy- or schedule-driven scaling of VM fleets, integrated with Load Balancer and health checks.

What it is

An Auto Scaling Group (ASG) defines:

  • A launch template — flavor, image, VPC, security groups, user-data
  • A capacity envelopemin, max, desired
  • Scaling policies — metric-based, schedule-based, or manual

The ASG keeps the running fleet inside the envelope, replacing unhealthy instances and scaling on signal.

When to use it

  • Web tier behind a public LB with diurnal traffic
  • API workers responding to queue depth
  • CI runners that go from zero to hero on PR open
  • Game / event-day surge capacity

Scaling policies

Trigger source Example
CPU utilisation "Add 1 instance when CPU > 70% for 5 min"
Memory utilisation "Add 1 instance when used RAM > 80% for 5 min"
Custom metric (CloudWatch-equivalent) "Scale on RabbitMQ queue depth"
LB request count per target Scale on LB-reported per-target throughput
Schedule (cron) "Scale to 10 every weekday 08:00; back to 2 at 22:00"
Manual API call Set desired directly

Health checks

ASG honours the LB target-group health check. Unhealthy → terminate → replace. Cool-down windows prevent thrash.

Lifecycle hooks

Standard hooks for pending, terminating, and terminating-wait. Useful for graceful drain (e.g., kubectl cordon a node before terminate, finish in-flight uploads, deregister from service mesh).

Integration

Pairs with What for
Load Balancer Front the fleet
Snapshots & Custom Images The launch template uses a custom image
Block Storage Boot/data disks per instance
Managed Kubernetes Node pools that scale on K8s metrics

Pricing

ASG itself is free — you pay for the running instances at standard VM rates (per-second, 60-second minimum). See Virtual Machines and Pricing.

Limits

  • Up to 200 ASGs per project
  • Up to 500 instances per ASG
  • All quotas bumpable

Operate this service

Designing an ASG that scales for the right signal, not against itself.

Launch template hygiene

The launch template is the stamp that's pressed onto every new instance. Treat it as code:

  • Reference a pinned custom image ID, not a moving "latest" tag. Newer instances landing on a half-tested image during a scale-out is a classic 03:00 incident.
  • Bake the app in. Boot-time package installs add 60–180 s of warm-up. The whole point of an ASG is fast scaling.
  • Embed only the secrets-pull script in user-data — fetch real secrets from Secrets Management at boot, don't bake.
  • Tag every instance with asg-name, env, cost-center for downstream rollup.

Capacity envelope

Knob Wrong guess Better default
min 1 (single point of failure) ≥ 2 across AZs
max 10× desired (silent runaway risk) 3× normal peak
desired 2 (set and forget) Whatever the policy says today

A max set too high means a single bad metric or runaway policy bills you 5× normal before anyone notices. Set the max just above your historical worst day.

Policy design

Pick a scaling metric that the workload causes, not one it experiences:

Workload Good metric Bad metric
Stateless web tier LB request count per target Instance CPU (lags the LB; oscillates)
Queue worker Queue depth / lag Instance CPU
Build farm CI queue depth Instance CPU
Real-time encoding Active session count Instance CPU

CPU is fine as a second signal, but anchoring on CPU alone creates oscillation: more CPU → scale up → more capacity → less CPU per instance → scale down → repeat. Use request-rate or queue-depth as the primary signal.

Cool-downs

After a scale event, the ASG ignores further signal for the cool-down window:

Action Default cool-down When to override
Scale-out (add) 300 s Lower (60 s) for spiky workloads
Scale-in (remove) 600 s Raise (1800 s) for app warm-up sensitive workloads

Aggressive scale-in is the enemy. The cost of an extra instance for 10 minutes is ~30 BDT; the cost of evicting a warm instance you'll need 30 seconds later is much higher.

Cost guardrails

  • Per-ASG monthly budget alert — set in console Cost Explorer → Budgets; alert at 80% and 110%.
  • Instance hours per day cap — set in the ASG policy; the ASG refuses to scale beyond cap / 24 h average.
  • Spot-mix policy (where supported) — allow up to N% of capacity from spot/preemptible flavors for cost savings; reserve baseline as on-demand.

IAM

Role Can do
asg.viewer List, see policies and history
asg.operator Manual scale (set desired), suspend policies
asg.builder Create/modify ASGs, launch templates, policies
asg.admin Above + delete ASGs, modify cost guardrails

Bind asg.operator to oncall — they may need to force-scale during incidents. Keep asg.builder to platform engineers.

Running an ASG day-to-day: watching for thrash, draining gracefully, and rolling out new launch templates.

Key metrics

Metric Healthy Alert
asg.desired_capacity Matches expected curve Pinned at max for >30 min
asg.in_service_instances Equals desired Lags by >2 for >5 min
asg.unhealthy_instances 0 ≥ 1 for >10 min
asg.scale_events_24h < 20 (typical) > 50 (thrash)
asg.lifecycle_hook_timeouts_24h 0 > 0

asg.scale_events_24h > 50 means the policy is fighting itself. Tune cool-downs or change the primary metric (see Policy design).

Lifecycle hooks

Hooks pause an instance at a known state so external systems can react. Use them for:

  • Pending → InService: wait for app to pass a deep health check before exposing to LB.
  • Terminating-wait: drain (kubectl cordon, deregister from service mesh, finish uploads) before the instance dies.

bash cd compute asg hook create \ --asg web-prod \ --name graceful-drain \ --transition terminating \ --timeout 300 \ --default-result CONTINUE

A hook that times out: the instance is terminated anyway with default-result=CONTINUE (safer for the ASG) or kept stuck with ABANDON (safer for in-flight work). Pick per workload.

Rolling deployments

Three patterns:

  1. Instance refresh (preferred) — ASG replaces instances one batch at a time, honouring min-healthy=N%: bash cd compute asg refresh start --asg web-prod \ --launch-template-version 47 \ --min-healthy-percent 90
  2. Blue/green via LB target groups — bring up a parallel ASG with the new template, shift LB weight, retire the old.
  3. Suspend + scale-out — suspend the scale-in policy, scale out to 2× desired with the new template, scale-in old instances.

Instance refresh is the default for non-critical workloads; blue/green for everything else.

Scheduled scaling for known patterns

Bangladesh's traffic has reliable daily and weekly patterns. Use scheduled actions to anticipate rather than react:

```bash

Pre-warm for the morning peak

cd compute asg schedule create --asg web-prod \ --name morning-peak \ --cron "0 7 * * 1-5" --tz Asia/Dhaka \ --desired 12

Scale down for the night

cd compute asg schedule create --asg web-prod \ --name nightly-quiet \ --cron "0 23 * * *" --tz Asia/Dhaka \ --desired 3 ```

Scheduled actions and metric-based policies coexist — the schedule sets a floor, the metrics can scale above it.

Suspending policies (incident mode)

During an incident, you may want to freeze the ASG to prevent thrash:

```bash cd compute asg policy suspend --asg web-prod --all

Investigate without auto-scaling adding/removing instances

cd compute asg policy resume --asg web-prod --all ```

Always resume before closing the incident. A common late-night oversight: suspending during a 3 AM incident and forgetting at 8 AM peak.

Spot/preemptible interruptions

If your ASG uses spot/preemptible flavors, expect a 2-minute interruption notice delivered as a metadata event. Handle in the instance:

```bash

Inside the guest — watch for the notice

curl -s http://169.254.169.254/latest/meta-data/spot/interruption-notice 2>/dev/null \ && /usr/local/bin/drain.sh ```

The terminating-wait hook will fire when the ASG receives the notice — use it for cluster-level drain.

Health checks: tune for honesty

ASG honours the LB target-group health check. Tune it to:

  • Healthy threshold 2 — survive a single bad sample
  • Unhealthy threshold 3 — don't terminate on a single hiccup
  • Path — a real /health endpoint, not / (which 200s on a half-broken app)

Health checks too strict → flapping → terminate → replace → thrash. Health checks too loose → bad instances stay in rotation → user-visible errors.

ASG won't scale out

asg.desired_capacity should rise but doesn't:

Cause Check Fix
Already at max cd compute asg show <name> Raise max (mind the cost guardrail)
Policy in cool-down Console → ASG → Recent events Wait, or lower cool-down
Policy suspended Console → ASG → Policies Resume policy
Custom-metric pipeline broken Cost Explorer → metric still emitting? Fix metric source
Project vCPU/RAM quota hit Project → Quotas Request quota bump

cd compute asg diagnose --asg <name> returns the most-recent scaling-decision reason.

Thrash (scale up, scale down, repeat)

asg.scale_events_24h > 50 is the signal. Causes:

  1. CPU as primary metric on a workload where CPU drops faster than instances drain → ASG sees "we have too many" and scales in → CPU spikes on remaining → ASG scales out. Switch primary metric to LB-request-count or queue-depth.
  2. Cool-down too short for warm-up time. Raise scale-in cool-down to 2× your app's warm-up time.
  3. Health check too strict — instances flap healthy/unhealthy. See tune health checks.

Instances pinned at max

If desired = max for hours: either your max is too low for current load (real demand), or a policy is stuck firing scale-out and never satisfied. Check console ASG → Recent scaling activities for a scale-out that's been retrying.

Instance refresh stuck

INFO: Refresh paused: min-healthy threshold not met (current 85%, threshold 90%)

The refresh waits for the fleet to recover before continuing. Causes:

  • New launch template's image is broken — instances fail health check
  • LB target group registration delayed
  • A lifecycle hook is timing out

```bash cd compute asg refresh status --asg web-prod

Shows per-instance refresh state and the wait reason

cd compute asg refresh cancel --asg web-prod # if stuck ```

Cancel doesn't roll back — instances already refreshed stay on the new template.

Lifecycle hook timeout

WARN: Lifecycle hook 'graceful-drain' timed out after 300 s

The external system handling the hook didn't call back in time. Two patterns:

  1. The drain script crashed — check the instance's cd-agent log
  2. The drain genuinely takes too long — raise the hook timeout, or kick off drain earlier (e.g., on pending for graceful warm-up, on terminating-wait for shutdown)

The instance is terminated regardless (with default-result=CONTINUE); ongoing work is lost. Investigate and fix before the next scale-in event.

New launch template versions not applied

Existing instances aren't auto-replaced when you change the launch template — that's by design. To apply:

  • Run an instance refresh (preferred), or
  • Manually terminate instances and let the ASG re-launch them

Don't expect "save the template and walk away" semantics.

Spot interruption surprises

Symptoms: requests start failing with 502s after a recent capacity drop you didn't trigger.

  • Check asg.spot_interruptions_24h — if non-zero, the platform reclaimed instances
  • Ensure the terminating-wait hook actually drains; check hook timeout metrics
  • If interruptions are too frequent (>3/day), reduce the spot mix percentage

Spot is for cost-tolerant capacity, not the critical path.

ASG marked unhealthy but instance looks fine

The instance passes its OS-level checks but the LB target-group reports unhealthy. Common causes:

Symptom Check
Health check on / returns 302 redirect LB expects 200; configure to allow 200,302
Health check timeout < app response time Raise health-check timeout
Health check from LB blocked by security group Allow LB CIDR on health-check port
App listening on wrong port App config vs LB target port

Sudden BDT bill spike

asg.scale_events_24h and the AZ-level capacity gauges are the diagnostic surface:

  1. Did desired hit max for hours? Real demand, expected cost.
  2. Did the ASG thrash? Operational bug — fix the policy.
  3. Did a new launch template change the flavor to something larger? Audit the change.

Use the Cost Explorer → ASG breakdown to see BDT-by-ASG-by-day.