Managed OpenSearch¶
Service ownership
Owner: data-platform (data-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11
Managed OpenSearch 2.x with OpenSearch Dashboards bundled. Search, log analytics, and security analytics in one engine.
What it is¶
OpenSearch (the Apache 2.0-licensed Elasticsearch fork) provisioned and operated by Cloud Digit. Includes the Dashboards plugin for visualization; alerting, anomaly-detection, security-analytics plugins are available.
Versions¶
OpenSearch 2.11, 2.13, 2.15.
Topologies¶
| Topology | Use case |
|---|---|
| Single-node | Dev / small data |
| 3-node HA | Standard production, ≤ 1 TB |
| Multi-node sharded | Larger production (with dedicated master nodes) |
| Hot-warm-cold tiers | Time-series logs with tiered retention |
Use cases¶
- Application search (faceted, full-text)
- Log analytics — fed by SIEM or directly from app logs
- Metrics analytics
- Security analytics (with the SIEM plugin)
- Vector search (8.x KNN; or use the dedicated Vector Database)
Pricing¶
Per-node-hour by flavor + storage. See Pricing.
Related¶
- Managed SIEM — common downstream consumer
- Vector Database
Operate this service¶
OpenSearch 2.x clusters for full-text search, log analytics, and metrics.
Sizing¶
| Workload | Master nodes | Data nodes | Per-data RAM |
|---|---|---|---|
| Dev / small search | 0 (combined) | 2 | 8 GB |
| Production search | 3 | 3+ | 16–64 GB |
| Logs / observability | 3 | 5+ | 32–64 GB |
| Vector search at scale | 3 | 5+ (GPU-able) | 64+ GB |
Dedicated master nodes at 3+ for stable cluster operations under load. Don't combine master+data in production.
IAM¶
| Role | Can do |
|---|---|
os.viewer | Read cluster metadata, indices |
os.reader | Search assigned indices |
os.writer | Index documents |
os.index-admin | Create/delete indices, manage mappings |
os.cluster-admin | Above + node management, snapshot policy |
Fine-grained access control: index patterns + field-level restrictions for sensitive logs.
Index lifecycle (ISM)¶
Index State Management policies for log workloads:
yaml policy: states: - name: hot actions: [] transitions: [{state_name: warm, conditions: {min_index_age: '7d'}}] - name: warm actions: [{force_merge: {max_num_segments: 1}}] transitions: [{state_name: cold, conditions: {min_index_age: '30d'}}] - name: cold actions: [{allocation: {require: {box_type: warm}}}] transitions: [{state_name: delete, conditions: {min_index_age: '365d'}}] - name: delete actions: [{delete: {}}]
Without ISM, log indices grow unbounded.
Sharding strategy¶
A single shard should be 20–50 GB. For a 1 TB index: ~20–50 shards. Too few = hot shards; too many = overhead.
bash cd db os index template put \ --name acme-logs \ --pattern 'acme-logs-*' \ --shards 10 \ --replicas 1
Backup¶
Daily snapshot to S3 (managed); retention 7 days default. Cross-region replication via cross-cluster snapshot copy.
Related¶
Metrics¶
| Metric | Healthy | Alert |
|---|---|---|
os.cluster.health | green | yellow / red |
os.cluster.relocating_shards | 0–N (during ops) | sustained > 0 |
os.cluster.unassigned_shards | 0 | > 0 |
os.jvm.heap_used_pct | < 75% | > 85% |
os.indices.search.latency_ms p99 | < 100 ms (search) | > 500 ms |
os.indices.indexing.latency_ms p99 | < 50 ms | > 200 ms |
os.thread_pool.search.queue | < 100 | > 1000 |
os.thread_pool.search.rejected_24h | 0 | > 0 |
Index template management¶
Templates apply to new indices matching a pattern:
bash cd db os index template put --name acme-logs \ --pattern 'acme-logs-*' \ --settings '{"number_of_shards": 10, "number_of_replicas": 1}' \ --mappings @mappings/acme-logs-mapping.json
Changing a template doesn't affect existing indices — only new ones. Reindex or wait for ISM rollover.
Reindex for breaking schema changes¶
bash cd db os reindex \ --source acme-logs-2026-04 \ --dest acme-logs-2026-04-v2 \ --query '{"match_all":{}}'
Async; track via _tasks API. Test on a copy before doing production indices.
Slow logs¶
bash cd db os param set --cluster acme-search-prod \ --search-slowlog-threshold 1s \ --indexing-slowlog-threshold 500ms
Slow query logs ship to a S3 bucket via the platform.
Cluster restart¶
Rolling restart for major version upgrade:
bash cd db os upgrade --cluster acme-search-prod --target-version 2.13 --strategy rolling
Each node drains, restarts, rejoins, recovers shards. Service remains online (with reduced capacity during the rolling window).
Snapshot policy¶
bash cd db os snapshot policy set --cluster acme-search-prod \ --schedule "0 2 * * *" --tz Asia/Dhaka \ --retention-days 30 \ --indices 'acme-*'
Related¶
Cluster red¶
os.cluster.health = red means at least one primary shard is missing.
bash cd db os cluster health --cluster acme-search-prod cd db os cluster reroute --cluster acme-search-prod --reason "explain unassigned"
Common causes: - Node down with no replica - Disk full preventing shard allocation - ISM policy moved shard to a non-existent node group
Heap pressure¶
os.jvm.heap_used_pct > 85% sustained:
- Field-data cache filled by aggregations on large keyword fields — review aggregations
- Lots of small shards (each shard has heap overhead) — reduce shard count
- Query patterns with deep pagination (
from > 10000) — switch to search-after or scroll
bash cd db os jvm stats --cluster acme-search-prod
Search rejected¶
os.thread_pool.search.rejected_24h > 0:
- Search thread pool saturated. Scale data nodes (more cores)
- Or your query mix has very-expensive queries — find via slow log
- Or you have too many small shards (each searched in parallel, overhead adds up)
Indexing lag¶
os.indices.indexing.latency_ms climbing:
- Disk I/O constrained (check
disk.write_iops) - Refresh interval too aggressive (default 1s; raise to 30s for bulk-ingest)
- Mapping bloat (too many dynamic fields)
Mapping explosion¶
ERROR: Limit of total fields [1000] in index has been exceeded
Dynamic mapping added thousands of fields from a sloppy log structure.
- Set
index.mapping.total_fields.limit(only if you must) - Better: fix the producer to emit consistent schema, or use
flattenedfield type for nested unknowns
Slow search after a bulk delete¶
OpenSearch doesn't reclaim space immediately. Force-merge:
bash cd db os indices force-merge --indices acme-logs-2026-04 --max-num-segments 1
Force-merge is expensive — do during low-traffic windows.
Snapshot restore stuck¶
Restore is async; check cd db os snapshot restore status. Common causes for slow restore: - Target cluster has less capacity than source (shards can't all fit) - S3 throughput limited (rare)