Logstash becomes unavailable under high load #914
-
|
migrating this from an email thread Original message:
|
Beta Was this translation helpful? Give feedback.
Replies: 7 comments 1 reply
-
|
Here are a few thoughts I have based on what you've said. In my experience, this is a process of a lot of trial and error to tune it.
|
Beta Was this translation helpful? Give feedback.
-
https://localhost/mapi/logstash/_health_report?pretty |
Beta Was this translation helpful? Give feedback.
-
|
_cat/nodes?v ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role node.roles cluster_manager name |
Beta Was this translation helpful? Give feedback.
-
|
_cat/thread_pool/write?v node_name name active queue rejected |
Beta Was this translation helpful? Give feedback.
-
|
Our opensearch cluster uses HDDs on a shared drive of the cluster, iostat returns a range from 50 to 75 on average each time i run it |
Beta Was this translation helpful? Give feedback.
-
|
pipeline.workers=8 |
Beta Was this translation helpful? Give feedback.
-
|
From the metrics you sent, IMO the slowdown is probably coming from disk writes on the HDD storage. Logstash is reporting that the malcolm-input pipeline workers are 100% blocked, which usually means they’re waiting on something downstream rather than being busy with CPU work. On the OpenSearch side, the cluster stats show very low CPU usage and no queued or rejected write tasks, so it doesn’t look like OpenSearch itself is overloaded. That usually leaves disk I/O as the limiting factor. Your iostat output shows the main disk sitting around 50-75% utilization, and since both Logstash’s persistent queue and OpenSearch indexing are writing to the same shared HDD storage, they’re competing for the same disk. HDDs tend to struggle with lots of small concurrent writes, especially on shared storage, so even if utilization doesn’t look maxed out it can still introduce enough latency that OpenSearch takes longer to finish writes. When that happens, Logstash has to wait for those indexing requests to complete, which causes the pipeline workers to block like we’re seeing. There is not a ton we can do, if this is the issue.
The other thing that seems weird to me is: if you grab But if I were betting I'd place it on the HDD being the problem. |
Beta Was this translation helpful? Give feedback.


From the metrics you sent, IMO the slowdown is probably coming from disk writes on the HDD storage. Logstash is reporting that the malcolm-input pipeline workers are 100% blocked, which usually means they’re waiting on something downstream rather than being busy with CPU work. On the OpenSearch side, the cluster stats show very low CPU usage and no queued or rejected write tasks, so it doesn’t look like OpenSearch itself is overloaded. That usually leaves disk I/O as the limiting factor. Your iostat output shows the main disk sitting around 50-75% utilization, and since both Logstash’s persistent queue and OpenSearch indexing are writing to the same shared HDD storage, they’re competing f…