Logstash becomes unavailable under high load #914

mmguero · 2026-03-04T18:10:23Z

mmguero
Mar 4, 2026
Maintainer

migrating this from an email thread

Original message:

Our logstash keeps jamming requiring a reset of the malcolm services. I noticed the container would go unhealthy if any shards went yellow or red in opensearch. I fixed the template settings that were causing the shard issues, but its still stalling out once a day at least. Any suggestions?
The machine has 128GB of RAM and 16 cores

Here is my logstash.env:

pipeline.workers=16
pipeline.batch.size=200
pipeline.batch.delay=5
...
LSJAVAOPTS=-server -Xmx24g -Xms24g -Xss4096k -XX:-HeapDumpOnOutOfMemoryError -Djava.security.egd=file:/dev/./urandom -Dlog4j.formatMsgNoLookups=true
queue.type=persisted
queue.maxbytes=32gb
queue.checkpoint.writes=1024
Path.queue=/logstash-persistent-queue

Answered by mmguero

Mar 4, 2026

From the metrics you sent, IMO the slowdown is probably coming from disk writes on the HDD storage. Logstash is reporting that the malcolm-input pipeline workers are 100% blocked, which usually means they’re waiting on something downstream rather than being busy with CPU work. On the OpenSearch side, the cluster stats show very low CPU usage and no queued or rejected write tasks, so it doesn’t look like OpenSearch itself is overloaded. That usually leaves disk I/O as the limiting factor. Your iostat output shows the main disk sitting around 50-75% utilization, and since both Logstash’s persistent queue and OpenSearch indexing are writing to the same shared HDD storage, they’re competing f…

View full answer

mmguero · 2026-03-04T18:11:49Z

mmguero
Mar 4, 2026
Maintainer Author

Here are a few thoughts I have based on what you've said. In my experience, this is a process of a lot of trial and error to tune it.

CPU/workers
- I assume that opensearch is also running on this same box? If so, with 16 cores and pipeline.workers=16 then there is some competition there. Maybe try dropping logstash workers to 8?
Batch size
- Your batch size is pretty good for heavy ingestion, but you could try setting it to 125 or 150 if you're seeing stalls.
New ordered setting
- Add pipeline.ordered=false to logstash.env, as event ordering in Malcolm doesn't matter
When it's running and "backed up", could you grab the output from these URIs on your Malcolm server and share the results:
- /mapi/logstash/_health_report?pretty
- /mapi/logstash/_node/stats?pretty
- /mapi/logstash/_node/stats/pipelines?pretty
- /mapi/opensearch/_cat/nodes?v
- /mapi/opensearch/_cat/thread_pool/write?v
- /mapi/opensearch/_cat/thread_pool/bulk?v
Heap size
- 24g (in LS_JAVA_OPTS) is quite large for a logstash heap. That could increase garbage collection pause time. I'd try dropping that (try 8g or 12g or 16g, in that order)
OpenSearch backing media
- What kind of disk are the opensearch indexes going onto (HDD, SDD, nvme, raid, ???)
- When it's in a bad state, grab a few iterations of the command iostat -x 1 and let's check disk io

0 replies

Bletcherous-Game-Development · 2026-03-04T21:36:34Z

Bletcherous-Game-Development
Mar 4, 2026

https://localhost/mapi/logstash/_health_report?pretty
{
"host" : "logstash",
"version" : "8.19.2",
"http_address" : "0.0.0.0:9600",
"id" : "de965d14-cd4f-4105-be62-c441ebd79fb7",
"name" : "logstash",
"ephemeral_id" : "4e666878-d316-4c24-8c44-95ffc7332a45",
"snapshot" : false,
"status" : "yellow",
"symptom" : "1 indicator is concerning (pipelines)",
"indicators" : {
"pipelines" : {
"status" : "yellow",
"symptom" : "1 indicator is concerning (malcolm-input), 1 indicator is unknown (malcolm-zeek), and 4 indicators are healthy (malcolm-enrichment, malcolm-suricata, malcolm-beats, malcolm-output)",
"indicators" : {
"malcolm-output" : {
"status" : "green",
"symptom" : "The pipeline is healthy",
"details" : {
"status" : {
"state" : "RUNNING"
},
"flow" : {
"worker_utilization" : {
"last_1_minute" : 9.763,
"current" : 0.0,
"lifetime" : 7.374
}
}
}
},
"malcolm-zeek" : {
"status" : "unknown",
"symptom" : "The pipeline is unknown; 1 area is impacted and 1 diagnosis is available",
"diagnosis" : [ {
"id" : "logstash:health:pipeline:status:diagnosis:unknown",
"cause" : "pipeline is not known; it may have been recently deleted or failed to start",
"action" : "view logs to determine if the pipeline failed to start",
"help_url" : "https://www.elastic.co/guide/en/logstash/8.19/health-report-pipeline-status.html#unknown"
} ],
"impacts" : [ {
"id" : "logstash:health:pipeline:status:impact:not_processing",
"severity" : 2,
"description" : "the pipeline is not currently processing",
"impact_areas" : [ "pipeline_execution" ]
} ],
"details" : {
"status" : {
"state" : "UNKNOWN"
}
}
},
"malcolm-beats" : {
"status" : "green",
"symptom" : "The pipeline is healthy",
"details" : {
"status" : {
"state" : "RUNNING"
},
"flow" : {
"worker_utilization" : {
"last_1_minute" : 0.4388,
"current" : 0.002066,
"lifetime" : 0.5787
}
}
}
},
"malcolm-input" : {
"status" : "yellow",
"symptom" : "The pipeline is concerning; 1 area is impacted and 1 diagnosis is available",
"diagnosis" : [ {
"id" : "logstash:health:pipeline:flow:worker_utilization:diagnosis:1m-blocked",
"cause" : "pipeline workers have been completely blocked for at least one minute",
"action" : "address bottleneck or add resources",
"help_url" : "https://www.elastic.co/guide/en/logstash/8.19/health-report-pipeline-flow-worker-utilization.html#blocked-1m"
} ],
"impacts" : [ {
"id" : "logstash:health:pipeline:flow:impact:blocked_processing",
"severity" : 2,
"description" : "the pipeline is blocked",
"impact_areas" : [ "pipeline_execution" ]
} ],
"details" : {
"status" : {
"state" : "RUNNING"
},
"flow" : {
"worker_utilization" : {
"last_1_minute" : 100.0,
"current" : 100.0,
"lifetime" : 91.31
}
}
}
},
"malcolm-suricata" : {
"status" : "green",
"symptom" : "The pipeline is healthy",
"details" : {
"status" : {
"state" : "RUNNING"
},
"flow" : {
"worker_utilization" : {
"last_1_minute" : 57.39,
"current" : 0.002066,
"lifetime" : 47.33
}
}
}
},
"malcolm-enrichment" : {
"status" : "green",
"symptom" : "The pipeline is healthy",
"details" : {
"status" : {
"state" : "RUNNING"
},
"flow" : {
"worker_utilization" : {
"last_1_minute" : 1.641,
"current" : 0.002066,
"lifetime" : 1.505
}
}
}
}
}
}
}

0 replies

Bletcherous-Game-Development · 2026-03-04T21:39:16Z

Bletcherous-Game-Development
Mar 4, 2026

_cat/nodes?v

ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role node.roles cluster_manager name
172.24.4.205 66 98 2 0.83 0.77 0.66 d data - pos-5
172.24.4.204 26 96 0 2.07 2.29 2.20 di data,ingest - pos-4
172.24.4.201 30 94 0 0.09 0.31 0.27 dm cluster_manager,data - pos-1
172.24.4.202 36 98 2 1.07 3.11 2.19 dm cluster_manager,data * pos-2
172.24.4.203 24 40 0 0.00 0.00 0.00 m cluster_manager - pos-3
172.24.4.207 69 98 1 0.55 0.57 0.50 d data - pos-7
172.24.4.209 43 95 0 0.00 0.01 0.00 d data - pos-9
172.24.4.208 15 43 0 0.00 0.00 0.00 - - - pos-8
172.24.4.206 50 99 7 3.07 2.30 1.62 d data - pos-6

0 replies

Bletcherous-Game-Development · 2026-03-04T21:39:55Z

Bletcherous-Game-Development
Mar 4, 2026

_cat/thread_pool/write?v

node_name name active queue rejected
pos-5 write 0 0 0
pos-4 write 0 0 0
pos-1 write 0 0 0
pos-2 write 0 0 0
pos-3 write 0 0 0
pos-7 write 0 0 0
pos-9 write 0 0 0
pos-8 write 0 0 0
pos-6 write 0 0 0

1 reply

Bletcherous-Game-Development Mar 4, 2026

no return for bulk

Bletcherous-Game-Development · 2026-03-04T21:44:21Z

Bletcherous-Game-Development
Mar 4, 2026

Our opensearch cluster uses HDDs on a shared drive of the cluster, iostat returns a range from 50 to 75 on average each time i run it

0 replies

Bletcherous-Game-Development · 2026-03-04T21:46:22Z

Bletcherous-Game-Development
Mar 4, 2026

pipeline.workers=8
pipeline.batch.size=125
pipeline.batch.delay=5 LOGSTASH_OUI_LOOKUP=false LOGSTASH_SEVERITY_SCORING=true LOGSTASH_REVERSE_DNS=false
LOGSTASH_NETBOX_ENRICHMENT_DATASETS=
LOGSTASH_ZEEK_IGNORED_LOGS=analyzer,broker,cluster,config,loaded_scripts,packet_filter,png,print,prof,reporter,roc_plus_unknow>
LS_JAVA_OPTS=-server -Xmx8g -Xms8g -Xss2048k -XX:-HeapDumpOnOutOfMemoryError -Djava.security.egd=file:/dev/./urandom -Dlog4j.f>
queue.type=persisted
queue.max_bytes=16gb
queue.checkpoint.writes=4096
path.queue=/logstash-persistent-queue
pipeline.ordered=false

0 replies

mmguero · 2026-03-04T22:17:00Z

mmguero
Mar 4, 2026
Maintainer Author

From the metrics you sent, IMO the slowdown is probably coming from disk writes on the HDD storage. Logstash is reporting that the malcolm-input pipeline workers are 100% blocked, which usually means they’re waiting on something downstream rather than being busy with CPU work. On the OpenSearch side, the cluster stats show very low CPU usage and no queued or rejected write tasks, so it doesn’t look like OpenSearch itself is overloaded. That usually leaves disk I/O as the limiting factor. Your iostat output shows the main disk sitting around 50-75% utilization, and since both Logstash’s persistent queue and OpenSearch indexing are writing to the same shared HDD storage, they’re competing for the same disk. HDDs tend to struggle with lots of small concurrent writes, especially on shared storage, so even if utilization doesn’t look maxed out it can still introduce enough latency that OpenSearch takes longer to finish writes. When that happens, Logstash has to wait for those indexing requests to complete, which causes the pipeline workers to block like we’re seeing.

There is not a ton we can do, if this is the issue.

replace the HDD with SSD, nVME, or RAID with many disks (RAID 10, probably? or RAID 0 if we want the max performance but at the risk of no redundancy)
even though it's a bit counter-intuitive, reducing pipeline workers further (like to 4, maybe?) could perhaps reduce concurrent bulk indexing (it's worth a shot)
your checkpoint writes are already pretty high (4096), we could maybe try bumping it up to like 8k but IDK if that's going to make a difference

The other thing that seems weird to me is:

 "cause" : "pipeline is not known; it may have been recently deleted or failed to start",

if you grab /mapi/logstash/_node/stats/pipelines?pretty (and attach it, it'll be huge) that might help us see? Or just get the logstash container output (docker compose --profile=malcolm logs logstash and redirect it to a file).

But if I were betting I'd place it on the HDD being the problem.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logstash becomes unavailable under high load #914

Uh oh!

{{title}}

Uh oh!

Replies: 7 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Logstash becomes unavailable under high load #914

Uh oh!

mmguero Mar 4, 2026 Maintainer

Replies: 7 comments · 1 reply

Uh oh!

mmguero Mar 4, 2026 Maintainer Author

Uh oh!

Bletcherous-Game-Development Mar 4, 2026

Uh oh!

Bletcherous-Game-Development Mar 4, 2026

Uh oh!

Bletcherous-Game-Development Mar 4, 2026

Uh oh!

Bletcherous-Game-Development Mar 4, 2026

Uh oh!

Bletcherous-Game-Development Mar 4, 2026

Uh oh!

Bletcherous-Game-Development Mar 4, 2026

Uh oh!

mmguero Mar 4, 2026 Maintainer Author

mmguero
Mar 4, 2026
Maintainer

Replies: 7 comments 1 reply

mmguero
Mar 4, 2026
Maintainer Author

Bletcherous-Game-Development
Mar 4, 2026

Bletcherous-Game-Development
Mar 4, 2026

Bletcherous-Game-Development
Mar 4, 2026

Bletcherous-Game-Development
Mar 4, 2026

Bletcherous-Game-Development
Mar 4, 2026

mmguero
Mar 4, 2026
Maintainer Author