Managed OpenSearch¶

Service ownership

Owner: data-platform (data-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11

Managed OpenSearch 2.x with OpenSearch Dashboards bundled. Search, log analytics, and security analytics in one engine.

What it is¶

OpenSearch (the Apache 2.0-licensed Elasticsearch fork) provisioned and operated by Cloud Digit. Includes the Dashboards plugin for visualization; alerting, anomaly-detection, security-analytics plugins are available.

Versions¶

OpenSearch 2.11, 2.13, 2.15.

Topologies¶

Topology	Use case
Single-node	Dev / small data
3-node HA	Standard production, ≤ 1 TB
Multi-node sharded	Larger production (with dedicated master nodes)
Hot-warm-cold tiers	Time-series logs with tiered retention

Use cases¶

Application search (faceted, full-text)
Log analytics — fed by SIEM or directly from app logs
Metrics analytics
Security analytics (with the SIEM plugin)
Vector search (8.x KNN; or use the dedicated Vector Database)

Pricing¶

Per-node-hour by flavor + storage. See Pricing.

Managed SIEM — common downstream consumer
Vector Database

Operate this service¶

AdministrationOperationTroubleshooting

OpenSearch 2.x clusters for full-text search, log analytics, and metrics.

Sizing¶

Workload	Master nodes	Data nodes	Per-data RAM
Dev / small search	0 (combined)	2	8 GB
Production search	3	3+	16–64 GB
Logs / observability	3	5+	32–64 GB
Vector search at scale	3	5+ (GPU-able)	64+ GB

Dedicated master nodes at 3+ for stable cluster operations under load. Don't combine master+data in production.

IAM¶

Role	Can do
`os.viewer`	Read cluster metadata, indices
`os.reader`	Search assigned indices
`os.writer`	Index documents
`os.index-admin`	Create/delete indices, manage mappings
`os.cluster-admin`	Above + node management, snapshot policy

Fine-grained access control: index patterns + field-level restrictions for sensitive logs.

Index lifecycle (ISM)¶

Index State Management policies for log workloads:

yaml policy: states: - name: hot actions: [] transitions: [{state_name: warm, conditions: {min_index_age: '7d'}}] - name: warm actions: [{force_merge: {max_num_segments: 1}}] transitions: [{state_name: cold, conditions: {min_index_age: '30d'}}] - name: cold actions: [{allocation: {require: {box_type: warm}}}] transitions: [{state_name: delete, conditions: {min_index_age: '365d'}}] - name: delete actions: [{delete: {}}]

Without ISM, log indices grow unbounded.

Sharding strategy¶

A single shard should be 20–50 GB. For a 1 TB index: ~20–50 shards. Too few = hot shards; too many = overhead.

bash cd db os index template put \ --name acme-logs \ --pattern 'acme-logs-*' \ --shards 10 \ --replicas 1

Backup¶

Daily snapshot to S3 (managed); retention 7 days default. Cross-region replication via cross-cluster snapshot copy.

Related¶

Metrics¶

Metric	Healthy	Alert
`os.cluster.health`	`green`	`yellow` / `red`
`os.cluster.relocating_shards`	0–N (during ops)	sustained > 0
`os.cluster.unassigned_shards`	0	> 0
`os.jvm.heap_used_pct`	< 75%	> 85%
`os.indices.search.latency_ms` p99	< 100 ms (search)	> 500 ms
`os.indices.indexing.latency_ms` p99	< 50 ms	> 200 ms
`os.thread_pool.search.queue`	< 100	> 1000
`os.thread_pool.search.rejected_24h`	0	> 0

Index template management¶

Templates apply to new indices matching a pattern:

bash cd db os index template put --name acme-logs \ --pattern 'acme-logs-*' \ --settings '{"number_of_shards": 10, "number_of_replicas": 1}' \ --mappings @mappings/acme-logs-mapping.json

Changing a template doesn't affect existing indices — only new ones. Reindex or wait for ISM rollover.

Reindex for breaking schema changes¶

bash cd db os reindex \ --source acme-logs-2026-04 \ --dest acme-logs-2026-04-v2 \ --query '{"match_all":{}}'

Async; track via _tasks API. Test on a copy before doing production indices.

Slow logs¶

bash cd db os param set --cluster acme-search-prod \ --search-slowlog-threshold 1s \ --indexing-slowlog-threshold 500ms

Slow query logs ship to a S3 bucket via the platform.

Cluster restart¶

Rolling restart for major version upgrade:

bash cd db os upgrade --cluster acme-search-prod --target-version 2.13 --strategy rolling

Each node drains, restarts, rejoins, recovers shards. Service remains online (with reduced capacity during the rolling window).

Snapshot policy¶

bash cd db os snapshot policy set --cluster acme-search-prod \ --schedule "0 2 * * *" --tz Asia/Dhaka \ --retention-days 30 \ --indices 'acme-*'

Related¶

Cluster red¶

os.cluster.health = red means at least one primary shard is missing.

bash cd db os cluster health --cluster acme-search-prod cd db os cluster reroute --cluster acme-search-prod --reason "explain unassigned"

Common causes: - Node down with no replica - Disk full preventing shard allocation - ISM policy moved shard to a non-existent node group

Heap pressure¶

os.jvm.heap_used_pct > 85% sustained:

Field-data cache filled by aggregations on large keyword fields — review aggregations
Lots of small shards (each shard has heap overhead) — reduce shard count
Query patterns with deep pagination (from > 10000) — switch to search-after or scroll

bash cd db os jvm stats --cluster acme-search-prod

Search rejected¶

os.thread_pool.search.rejected_24h > 0:

Search thread pool saturated. Scale data nodes (more cores)
Or your query mix has very-expensive queries — find via slow log
Or you have too many small shards (each searched in parallel, overhead adds up)

Indexing lag¶

os.indices.indexing.latency_ms climbing:

Disk I/O constrained (check disk.write_iops)
Refresh interval too aggressive (default 1s; raise to 30s for bulk-ingest)
Mapping bloat (too many dynamic fields)

Mapping explosion¶

ERROR: Limit of total fields [1000] in index has been exceeded

Dynamic mapping added thousands of fields from a sloppy log structure.

Set index.mapping.total_fields.limit (only if you must)
Better: fix the producer to emit consistent schema, or use flattened field type for nested unknowns

Slow search after a bulk delete¶

OpenSearch doesn't reclaim space immediately. Force-merge:

bash cd db os indices force-merge --indices acme-logs-2026-04 --max-num-segments 1

Force-merge is expensive — do during low-traffic windows.

Snapshot restore stuck¶

Restore is async; check cd db os snapshot restore status. Common causes for slow restore: - Target cluster has less capacity than source (shards can't all fit) - S3 throughput limited (rare)