Auto Scaling Groups¶

Service ownership

Owner: compute-platform (compute-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11

Policy- or schedule-driven scaling of VM fleets, integrated with Load Balancer and health checks.

What it is¶

An Auto Scaling Group (ASG) defines:

A launch template — flavor, image, VPC, security groups, user-data
A capacity envelope — min, max, desired
Scaling policies — metric-based, schedule-based, or manual

The ASG keeps the running fleet inside the envelope, replacing unhealthy instances and scaling on signal.

When to use it¶

Web tier behind a public LB with diurnal traffic
API workers responding to queue depth
CI runners that go from zero to hero on PR open
Game / event-day surge capacity

Scaling policies¶

Trigger source	Example
CPU utilisation	"Add 1 instance when CPU > 70% for 5 min"
Memory utilisation	"Add 1 instance when used RAM > 80% for 5 min"
Custom metric (CloudWatch-equivalent)	"Scale on RabbitMQ queue depth"
LB request count per target	Scale on LB-reported per-target throughput
Schedule (cron)	"Scale to 10 every weekday 08:00; back to 2 at 22:00"
Manual API call	Set `desired` directly

Health checks¶

ASG honours the LB target-group health check. Unhealthy → terminate → replace. Cool-down windows prevent thrash.

Lifecycle hooks¶

Standard hooks for pending, terminating, and terminating-wait. Useful for graceful drain (e.g., kubectl cordon a node before terminate, finish in-flight uploads, deregister from service mesh).

Integration¶

Pairs with	What for
Load Balancer	Front the fleet
Snapshots & Custom Images	The launch template uses a custom image
Block Storage	Boot/data disks per instance
Managed Kubernetes	Node pools that scale on K8s metrics

Pricing¶

ASG itself is free — you pay for the running instances at standard VM rates (per-second, 60-second minimum). See Virtual Machines and Pricing.

Limits¶

Up to 200 ASGs per project
Up to 500 instances per ASG
All quotas bumpable

Operate this service¶

AdministrationOperationTroubleshooting

Designing an ASG that scales for the right signal, not against itself.

Launch template hygiene¶

The launch template is the stamp that's pressed onto every new instance. Treat it as code:

Reference a pinned custom image ID, not a moving "latest" tag. Newer instances landing on a half-tested image during a scale-out is a classic 03:00 incident.
Bake the app in. Boot-time package installs add 60–180 s of warm-up. The whole point of an ASG is fast scaling.
Embed only the secrets-pull script in user-data — fetch real secrets from Secrets Management at boot, don't bake.
Tag every instance with asg-name, env, cost-center for downstream rollup.

Capacity envelope¶

Knob	Wrong guess	Better default
`min`	1 (single point of failure)	≥ 2 across AZs
`max`	10× `desired` (silent runaway risk)	3× normal peak
`desired`	2 (set and forget)	Whatever the policy says today

A max set too high means a single bad metric or runaway policy bills you 5× normal before anyone notices. Set the max just above your historical worst day.

Policy design¶

Pick a scaling metric that the workload causes, not one it experiences:

Workload	Good metric	Bad metric
Stateless web tier	LB request count per target	Instance CPU (lags the LB; oscillates)
Queue worker	Queue depth / lag	Instance CPU
Build farm	CI queue depth	Instance CPU
Real-time encoding	Active session count	Instance CPU

CPU is fine as a second signal, but anchoring on CPU alone creates oscillation: more CPU → scale up → more capacity → less CPU per instance → scale down → repeat. Use request-rate or queue-depth as the primary signal.

Cool-downs¶

After a scale event, the ASG ignores further signal for the cool-down window:

Action	Default cool-down	When to override
Scale-out (add)	300 s	Lower (60 s) for spiky workloads
Scale-in (remove)	600 s	Raise (1800 s) for app warm-up sensitive workloads

Aggressive scale-in is the enemy. The cost of an extra instance for 10 minutes is ~30 BDT; the cost of evicting a warm instance you'll need 30 seconds later is much higher.

Cost guardrails¶

Per-ASG monthly budget alert — set in console Cost Explorer → Budgets; alert at 80% and 110%.
Instance hours per day cap — set in the ASG policy; the ASG refuses to scale beyond cap / 24 h average.
Spot-mix policy (where supported) — allow up to N% of capacity from spot/preemptible flavors for cost savings; reserve baseline as on-demand.

IAM¶

Role	Can do
`asg.viewer`	List, see policies and history
`asg.operator`	Manual scale (set `desired`), suspend policies
`asg.builder`	Create/modify ASGs, launch templates, policies
`asg.admin`	Above + delete ASGs, modify cost guardrails

Bind asg.operator to oncall — they may need to force-scale during incidents. Keep asg.builder to platform engineers.

Related¶

Running an ASG day-to-day: watching for thrash, draining gracefully, and rolling out new launch templates.

Key metrics¶

Metric	Healthy	Alert
`asg.desired_capacity`	Matches expected curve	Pinned at `max` for >30 min
`asg.in_service_instances`	Equals `desired`	Lags by >2 for >5 min
`asg.unhealthy_instances`	0	≥ 1 for >10 min
`asg.scale_events_24h`	< 20 (typical)	> 50 (thrash)
`asg.lifecycle_hook_timeouts_24h`	0	> 0

asg.scale_events_24h > 50 means the policy is fighting itself. Tune cool-downs or change the primary metric (see Policy design).

Lifecycle hooks¶

Hooks pause an instance at a known state so external systems can react. Use them for:

Pending → InService: wait for app to pass a deep health check before exposing to LB.
Terminating-wait: drain (kubectl cordon, deregister from service mesh, finish uploads) before the instance dies.

bash cd compute asg hook create \ --asg web-prod \ --name graceful-drain \ --transition terminating \ --timeout 300 \ --default-result CONTINUE

A hook that times out: the instance is terminated anyway with default-result=CONTINUE (safer for the ASG) or kept stuck with ABANDON (safer for in-flight work). Pick per workload.

Rolling deployments¶

Three patterns:

Instance refresh (preferred) — ASG replaces instances one batch at a time, honouring min-healthy=N%: bash cd compute asg refresh start --asg web-prod \ --launch-template-version 47 \ --min-healthy-percent 90
Blue/green via LB target groups — bring up a parallel ASG with the new template, shift LB weight, retire the old.
Suspend + scale-out — suspend the scale-in policy, scale out to 2× desired with the new template, scale-in old instances.

Instance refresh is the default for non-critical workloads; blue/green for everything else.

Scheduled scaling for known patterns¶

Bangladesh's traffic has reliable daily and weekly patterns. Use scheduled actions to anticipate rather than react:

```bash

Pre-warm for the morning peak¶

cd compute asg schedule create --asg web-prod \ --name morning-peak \ --cron "0 7 * * 1-5" --tz Asia/Dhaka \ --desired 12

Scale down for the night¶

cd compute asg schedule create --asg web-prod \ --name nightly-quiet \ --cron "0 23 * * *" --tz Asia/Dhaka \ --desired 3 ```

Scheduled actions and metric-based policies coexist — the schedule sets a floor, the metrics can scale above it.

Suspending policies (incident mode)¶

During an incident, you may want to freeze the ASG to prevent thrash:

```bash cd compute asg policy suspend --asg web-prod --all

Investigate without auto-scaling adding/removing instances¶

cd compute asg policy resume --asg web-prod --all ```

Always resume before closing the incident. A common late-night oversight: suspending during a 3 AM incident and forgetting at 8 AM peak.

Spot/preemptible interruptions¶

If your ASG uses spot/preemptible flavors, expect a 2-minute interruption notice delivered as a metadata event. Handle in the instance:

```bash

Inside the guest — watch for the notice¶

curl -s http://169.254.169.254/latest/meta-data/spot/interruption-notice 2>/dev/null \ && /usr/local/bin/drain.sh ```

The terminating-wait hook will fire when the ASG receives the notice — use it for cluster-level drain.

Health checks: tune for honesty¶

ASG honours the LB target-group health check. Tune it to:

Healthy threshold 2 — survive a single bad sample
Unhealthy threshold 3 — don't terminate on a single hiccup
Path — a real /health endpoint, not / (which 200s on a half-broken app)

Health checks too strict → flapping → terminate → replace → thrash. Health checks too loose → bad instances stay in rotation → user-visible errors.

Related¶

ASG won't scale out¶

asg.desired_capacity should rise but doesn't:

Cause	Check	Fix
Already at `max`	`cd compute asg show <name>`	Raise `max` (mind the cost guardrail)
Policy in cool-down	Console → ASG → Recent events	Wait, or lower cool-down
Policy suspended	Console → ASG → Policies	Resume policy
Custom-metric pipeline broken	Cost Explorer → metric still emitting?	Fix metric source
Project vCPU/RAM quota hit	Project → Quotas	Request quota bump

cd compute asg diagnose --asg <name> returns the most-recent scaling-decision reason.

Thrash (scale up, scale down, repeat)¶

asg.scale_events_24h > 50 is the signal. Causes:

CPU as primary metric on a workload where CPU drops faster than instances drain → ASG sees "we have too many" and scales in → CPU spikes on remaining → ASG scales out. Switch primary metric to LB-request-count or queue-depth.
Cool-down too short for warm-up time. Raise scale-in cool-down to 2× your app's warm-up time.
Health check too strict — instances flap healthy/unhealthy. See tune health checks.

Instances pinned at `max`¶

If desired = max for hours: either your max is too low for current load (real demand), or a policy is stuck firing scale-out and never satisfied. Check console ASG → Recent scaling activities for a scale-out that's been retrying.

Instance refresh stuck¶

INFO: Refresh paused: min-healthy threshold not met (current 85%, threshold 90%)

The refresh waits for the fleet to recover before continuing. Causes:

New launch template's image is broken — instances fail health check
LB target group registration delayed
A lifecycle hook is timing out

```bash cd compute asg refresh status --asg web-prod

Shows per-instance refresh state and the wait reason¶

cd compute asg refresh cancel --asg web-prod # if stuck ```

Cancel doesn't roll back — instances already refreshed stay on the new template.

Lifecycle hook timeout¶

WARN: Lifecycle hook 'graceful-drain' timed out after 300 s

The external system handling the hook didn't call back in time. Two patterns:

The drain script crashed — check the instance's cd-agent log
The drain genuinely takes too long — raise the hook timeout, or kick off drain earlier (e.g., on pending for graceful warm-up, on terminating-wait for shutdown)

The instance is terminated regardless (with default-result=CONTINUE); ongoing work is lost. Investigate and fix before the next scale-in event.

New launch template versions not applied¶

Existing instances aren't auto-replaced when you change the launch template — that's by design. To apply:

Run an instance refresh (preferred), or
Manually terminate instances and let the ASG re-launch them

Don't expect "save the template and walk away" semantics.

Spot interruption surprises¶

Symptoms: requests start failing with 502s after a recent capacity drop you didn't trigger.

Check asg.spot_interruptions_24h — if non-zero, the platform reclaimed instances
Ensure the terminating-wait hook actually drains; check hook timeout metrics
If interruptions are too frequent (>3/day), reduce the spot mix percentage

Spot is for cost-tolerant capacity, not the critical path.

ASG marked unhealthy but instance looks fine¶

The instance passes its OS-level checks but the LB target-group reports unhealthy. Common causes:

Symptom	Check
Health check on `/` returns 302 redirect	LB expects 200; configure to allow 200,302
Health check timeout < app response time	Raise health-check timeout
Health check from LB blocked by security group	Allow LB CIDR on health-check port
App listening on wrong port	App config vs LB target port

Sudden BDT bill spike¶

asg.scale_events_24h and the AZ-level capacity gauges are the diagnostic surface:

Did desired hit max for hours? Real demand, expected cost.
Did the ASG thrash? Operational bug — fix the policy.
Did a new launch template change the flavor to something larger? Audit the change.

Use the Cost Explorer → ASG breakdown to see BDT-by-ASG-by-day.

Auto Scaling Groups¶

What it is¶

When to use it¶

Scaling policies¶

Health checks¶

Lifecycle hooks¶

Integration¶

Pricing¶

Limits¶

Related¶

Operate this service¶

Launch template hygiene¶

Capacity envelope¶

Policy design¶

Cool-downs¶

Cost guardrails¶

IAM¶

Related¶

Key metrics¶

Lifecycle hooks¶

Rolling deployments¶

Scheduled scaling for known patterns¶

Pre-warm for the morning peak¶

Scale down for the night¶

Suspending policies (incident mode)¶

Investigate without auto-scaling adding/removing instances¶

Spot/preemptible interruptions¶

Inside the guest — watch for the notice¶

Health checks: tune for honesty¶

Related¶

ASG won't scale out¶

Thrash (scale up, scale down, repeat)¶

Instances pinned at max¶

Instance refresh stuck¶

Shows per-instance refresh state and the wait reason¶

Lifecycle hook timeout¶

New launch template versions not applied¶

Spot interruption surprises¶

ASG marked unhealthy but instance looks fine¶

Sudden BDT bill spike¶

Related¶

Instances pinned at `max`¶