Skip to content

Managed OpenSearch

Service ownership

Owner: data-platform (data-pm@clouddigit.ai) — Status: GA — Last audited: 2026-05-11

Managed OpenSearch 2.x with OpenSearch Dashboards bundled. Search, log analytics, and security analytics in one engine.

What it is

OpenSearch (the Apache 2.0-licensed Elasticsearch fork) provisioned and operated by Cloud Digit. Includes the Dashboards plugin for visualization; alerting, anomaly-detection, security-analytics plugins are available.

Versions

OpenSearch 2.11, 2.13, 2.15.

Topologies

Topology Use case
Single-node Dev / small data
3-node HA Standard production, ≤ 1 TB
Multi-node sharded Larger production (with dedicated master nodes)
Hot-warm-cold tiers Time-series logs with tiered retention

Use cases

  • Application search (faceted, full-text)
  • Log analytics — fed by SIEM or directly from app logs
  • Metrics analytics
  • Security analytics (with the SIEM plugin)
  • Vector search (8.x KNN; or use the dedicated Vector Database)

Pricing

Per-node-hour by flavor + storage. See Pricing.

Operate this service

OpenSearch 2.x clusters for full-text search, log analytics, and metrics.

Sizing

Workload Master nodes Data nodes Per-data RAM
Dev / small search 0 (combined) 2 8 GB
Production search 3 3+ 16–64 GB
Logs / observability 3 5+ 32–64 GB
Vector search at scale 3 5+ (GPU-able) 64+ GB

Dedicated master nodes at 3+ for stable cluster operations under load. Don't combine master+data in production.

IAM

Role Can do
os.viewer Read cluster metadata, indices
os.reader Search assigned indices
os.writer Index documents
os.index-admin Create/delete indices, manage mappings
os.cluster-admin Above + node management, snapshot policy

Fine-grained access control: index patterns + field-level restrictions for sensitive logs.

Index lifecycle (ISM)

Index State Management policies for log workloads:

yaml policy: states: - name: hot actions: [] transitions: [{state_name: warm, conditions: {min_index_age: '7d'}}] - name: warm actions: [{force_merge: {max_num_segments: 1}}] transitions: [{state_name: cold, conditions: {min_index_age: '30d'}}] - name: cold actions: [{allocation: {require: {box_type: warm}}}] transitions: [{state_name: delete, conditions: {min_index_age: '365d'}}] - name: delete actions: [{delete: {}}]

Without ISM, log indices grow unbounded.

Sharding strategy

A single shard should be 20–50 GB. For a 1 TB index: ~20–50 shards. Too few = hot shards; too many = overhead.

bash cd db os index template put \ --name acme-logs \ --pattern 'acme-logs-*' \ --shards 10 \ --replicas 1

Backup

Daily snapshot to S3 (managed); retention 7 days default. Cross-region replication via cross-cluster snapshot copy.

Metrics

Metric Healthy Alert
os.cluster.health green yellow / red
os.cluster.relocating_shards 0–N (during ops) sustained > 0
os.cluster.unassigned_shards 0 > 0
os.jvm.heap_used_pct < 75% > 85%
os.indices.search.latency_ms p99 < 100 ms (search) > 500 ms
os.indices.indexing.latency_ms p99 < 50 ms > 200 ms
os.thread_pool.search.queue < 100 > 1000
os.thread_pool.search.rejected_24h 0 > 0

Index template management

Templates apply to new indices matching a pattern:

bash cd db os index template put --name acme-logs \ --pattern 'acme-logs-*' \ --settings '{"number_of_shards": 10, "number_of_replicas": 1}' \ --mappings @mappings/acme-logs-mapping.json

Changing a template doesn't affect existing indices — only new ones. Reindex or wait for ISM rollover.

Reindex for breaking schema changes

bash cd db os reindex \ --source acme-logs-2026-04 \ --dest acme-logs-2026-04-v2 \ --query '{"match_all":{}}'

Async; track via _tasks API. Test on a copy before doing production indices.

Slow logs

bash cd db os param set --cluster acme-search-prod \ --search-slowlog-threshold 1s \ --indexing-slowlog-threshold 500ms

Slow query logs ship to a S3 bucket via the platform.

Cluster restart

Rolling restart for major version upgrade:

bash cd db os upgrade --cluster acme-search-prod --target-version 2.13 --strategy rolling

Each node drains, restarts, rejoins, recovers shards. Service remains online (with reduced capacity during the rolling window).

Snapshot policy

bash cd db os snapshot policy set --cluster acme-search-prod \ --schedule "0 2 * * *" --tz Asia/Dhaka \ --retention-days 30 \ --indices 'acme-*'

Cluster red

os.cluster.health = red means at least one primary shard is missing.

bash cd db os cluster health --cluster acme-search-prod cd db os cluster reroute --cluster acme-search-prod --reason "explain unassigned"

Common causes: - Node down with no replica - Disk full preventing shard allocation - ISM policy moved shard to a non-existent node group

Heap pressure

os.jvm.heap_used_pct > 85% sustained:

  • Field-data cache filled by aggregations on large keyword fields — review aggregations
  • Lots of small shards (each shard has heap overhead) — reduce shard count
  • Query patterns with deep pagination (from > 10000) — switch to search-after or scroll

bash cd db os jvm stats --cluster acme-search-prod

Search rejected

os.thread_pool.search.rejected_24h > 0:

  • Search thread pool saturated. Scale data nodes (more cores)
  • Or your query mix has very-expensive queries — find via slow log
  • Or you have too many small shards (each searched in parallel, overhead adds up)

Indexing lag

os.indices.indexing.latency_ms climbing:

  • Disk I/O constrained (check disk.write_iops)
  • Refresh interval too aggressive (default 1s; raise to 30s for bulk-ingest)
  • Mapping bloat (too many dynamic fields)

Mapping explosion

ERROR: Limit of total fields [1000] in index has been exceeded

Dynamic mapping added thousands of fields from a sloppy log structure.

  • Set index.mapping.total_fields.limit (only if you must)
  • Better: fix the producer to emit consistent schema, or use flattened field type for nested unknowns

Slow search after a bulk delete

OpenSearch doesn't reclaim space immediately. Force-merge:

bash cd db os indices force-merge --indices acme-logs-2026-04 --max-num-segments 1

Force-merge is expensive — do during low-traffic windows.

Snapshot restore stuck

Restore is async; check cd db os snapshot restore status. Common causes for slow restore: - Target cluster has less capacity than source (shards can't all fit) - S3 throughput limited (rare)