Reproduction and fix for backpressure processing stalling by kibertoad · Pull Request #254 · platformatic/kafka

kibertoad · 2026-03-17T17:28:48Z

Problem

When MessagesStream is consumed via pipeline() through a downstream Duplex that triggers backpressure (e.g. a batching stream that groups messages by topic-partition), the internal fetch loop can die permanently, leaving unconsumed messages in Kafka.

This happens in production when a consumer uses pipeline(consumerStream, batchStream) with a batch-accumulating Duplex. Under moderate load (e.g. 15 topics ×1000 messages), the consumer stalls after processing ~30% of messages.

Root cause

Two bugs in the fetch loop lifecycle:

canPush gate in #pushRecords kills the loop

After pushing fetched messages to the readable buffer, the next #fetch() was only scheduled when push() returned true (buffer below highWaterMark). When push() returned false, the loop relied on _read() to restart it. In pull mode this works, but in flowing mode with pipeline(), _read() is not reliably called again when the buffer is already empty - Node.js considers the stream "already flowing" and skips the _read() call. The fetch loop dies.

resume() doesn't restart the fetch loop after backpressure

When the downstream Duplex's write() returns false, pipeline()'s internal pipe() calls pause() on the consumer stream, setting #paused = true. If a previously scheduled process.nextTick(#fetch) fires while #paused is still true, #fetch() returns early without scheduling another iteration — the loop is now dead. When pipe() later calls resume() (after the downstream drains), super.resume() does not reliably trigger _read() when the readable buffer is empty and the stream is in flowing mode.

Sequence leading to stall

#pushRecords pushes N messages → push() returns false (buffer > highWaterMark)
process.nextTick(#fetch) is scheduled
pipeline's pipe() calls pause() → #paused = true
(downstream batch stream's write() returned false — backpressure)
nextTick fires → #fetch() sees #paused = true → returns (loop dead)
Downstream drains → pipe() calls resume() → #paused = false
super.resume() does NOT trigger _read() (stream already in flowing mode)
No more fetches. Consumer sits idle with unconsumed messages in Kafka.

Fix

Fix 1: Remove the canPush gate — always schedule process.nextTick(#fetch) after #pushRecords. This is safe because
#fetch() checks #paused, #closed, and other guards before issuing a Kafka fetch request.

Fix 2: In resume(), when transitioning from paused → unpaused, schedule process.nextTick(#fetch) to explicitly restart
the loop. A wasPaused guard prevents firing during initial pipeline() setup (where resume() is called before
_construct() completes).

Reproduction

The load test in playground/load-tests/ reproduces the stall deterministically:

Start Kafka

docker compose up -d

Without fix: stalls at ~4500/15000 (30%)

npm run load:backpressure:light

With fix: 15000/15000 consumed

npm run load:backpressure:light

The test mirrors the real-world setup: consumer starts before publishing (fetches partial results as messages arrive),
messages are published interleaved across 15 topics, and a batch-accumulating Duplex downstream triggers
backpressure.

--- Signed-off-by: Igor Savin <iselwin@gmail.com>

benchmarks/crc32c.ts

kibertoad · 2026-03-17T17:30:28Z

src/clients/consumer/messages-stream.ts

+    //
+    // Unconditionally scheduling is safe because #fetch() checks #paused,
+    // #closed, and other guards before issuing a Kafka fetch request.
+    process.nextTick(() => {


Actual fix 1

@kibertoad can you clarify why it's safe and provide some references?

See:

#fetch () { /* c8 ignore next 4 - Hard to test */ if (this.#closed || this.closed || this.destroyed) { this.push(null) return }

If the stream was closed/destroyed during record processing, #fetch() returns immediately at line 487-490;

If backpressure kicked in (pipeline called pause(), setting #paused = true), #fetch() returns at line 493 without issuing a Kafka request. The resume() override (line 336-361) will restart the loop when backpressure clears;

If readableFlowing === null (no consumer attached yet), it also bails out at line 493;

If offset refresh is happening, same early return.

Did I miss anything?

Can you add a test that verifies under extreme load and non-consumption (stream paused), the memory is not ballooning? I see the tests are verifying non-stalling, not avoiding leaks.

kibertoad · 2026-03-17T17:30:35Z

src/clients/consumer/messages-stream.ts

-    return super.resume()
+    const result = super.resume()
+
+    // Restart the fetch loop when transitioning from paused → unpaused.


Actual fix 2

src/clients/consumer/messages-stream.ts

--- Signed-off-by: Igor Savin <iselwin@gmail.com>

kibertoad · 2026-03-17T21:08:46Z

@ShogunPanda could you please take a look?

…/backpressure-stall # Conflicts: # src/clients/consumer/messages-stream.ts

--- Signed-off-by: Igor Savin <iselwin@gmail.com>

…/backpressure-stall # Conflicts: # src/clients/consumer/messages-stream.ts

--- Signed-off-by: Igor Savin <iselwin@gmail.com>

ShogunPanda · 2026-03-19T08:51:36Z

@kibertoad In general it LGTM. I'm waiting for @mcollina to confirm that removing canPush won't cause harm.

In the meanwhile, do you think you can provide a test?

kibertoad · 2026-03-19T08:54:08Z

@ShogunPanda I tried to, but it doesn't reproduce well within a test, unfortunately, as it needs significant load to surface. Let me try to reproduce exactly what the load script is doing...

--- Signed-off-by: Igor Savin <iselwin@gmail.com>

kibertoad · 2026-03-19T09:05:08Z

@ShogunPanda go figure - asking Claude to reproduce exactly the setup in the load test did the trick - added the test that fails without the fix.

mcollina · 2026-03-19T10:04:57Z

src/clients/consumer/messages-stream.ts

+    //
+    // Unconditionally scheduling is safe because #fetch() checks #paused,
+    // #closed, and other guards before issuing a Kafka fetch request.
+    process.nextTick(() => {


Can you add a test that verifies under extreme load and non-consumption (stream paused), the memory is not ballooning? I see the tests are verifying non-stalling, not avoiding leaks.

kibertoad · 2026-03-19T10:08:06Z

@mcollina what pattern would you recommend for checking memory usage in tests?

mcollina · 2026-03-19T10:22:47Z

I would do two measures:

something similar to https://github.com/nodejs/undici/blob/1c5dc1ad36c886aa11d025cf6381c5ea1fff0ca4/test/tls-cert-leak.js, https://github.com/nodejs/undici/blob/1c5dc1ad36c886aa11d025cf6381c5ea1fff0ca4/test/fetch/fire-and-forget.js#L85 or https://github.com/nodejs/undici/blob/1c5dc1ad36c886aa11d025cf6381c5ea1fff0ca4/test/fetch/fetch-leak.js (there are probably other examples of this in undici)
assertions on streams (https://nodejs.org/docs/latest/api/stream.html#readablereadablelength and writable equivalent)

--- Signed-off-by: Igor Savin <iselwin@gmail.com>

kibertoad · 2026-03-19T13:17:37Z

@mcollina I've added the tests. Memory use one is not executed in CI, as it's too beefy for it, unfortunately.

kibertoad · 2026-03-19T13:18:18Z

@mcollina failing test is just flaky, I think, could you rerun?

ShogunPanda · 2026-03-19T14:23:05Z

@kibertoad Done

ShogunPanda · 2026-03-19T15:51:10Z

@kibertoad In order to have the memory one running, I had to add:

memory-fix.patch

Do you mind checking it locally and eventually push it to your branch?

--- Signed-off-by: Igor Savin <iselwin@gmail.com>

kibertoad · 2026-03-19T15:54:30Z

@ShogunPanda applied

mcollina

lgtm

kibertoad · 2026-03-19T17:08:12Z

@mcollina @ShogunPanda thank you! would it be possible to release a new version now?

Reproduction and fix for backpressure processing stalling

79f865e

--- Signed-off-by: Igor Savin <iselwin@gmail.com>

kibertoad commented Mar 17, 2026

View reviewed changes

benchmarks/crc32c.ts Show resolved Hide resolved

kibertoad commented Mar 17, 2026

View reviewed changes

src/clients/consumer/messages-stream.ts Show resolved Hide resolved

kibertoad added 2 commits March 17, 2026 19:32

Cleanup

c78a858

--- Signed-off-by: Igor Savin <iselwin@gmail.com>

Fix linting

220602f

--- Signed-off-by: Igor Savin <iselwin@gmail.com>

kibertoad added 4 commits March 18, 2026 17:34

Merge branch 'main' of https://github.com/platformatic/kafka into fix…

9364328

…/backpressure-stall # Conflicts: # src/clients/consumer/messages-stream.ts

Fix formatting

fba8588

--- Signed-off-by: Igor Savin <iselwin@gmail.com>

Merge branch 'main' of https://github.com/platformatic/kafka into fix…

d17f5a7

…/backpressure-stall # Conflicts: # src/clients/consumer/messages-stream.ts

Fix formatting

ac8d558

--- Signed-off-by: Igor Savin <iselwin@gmail.com>

Reproduction test

cba4449

--- Signed-off-by: Igor Savin <iselwin@gmail.com>

mcollina requested changes Mar 19, 2026

View reviewed changes

Add memory test

c47b2b7

--- Signed-off-by: Igor Savin <iselwin@gmail.com>

kibertoad marked this pull request as draft March 19, 2026 10:38

kibertoad added 4 commits March 19, 2026 13:19

Revise memory test

f9ec6df

--- Signed-off-by: Igor Savin <iselwin@gmail.com>

Finalize tests

b5bec75

--- Signed-off-by: Igor Savin <iselwin@gmail.com>

Fix CI

7f8b272

--- Signed-off-by: Igor Savin <iselwin@gmail.com>

Remove memory test from CI

d54d036

--- Signed-off-by: Igor Savin <iselwin@gmail.com>

kibertoad requested review from ShogunPanda and mcollina March 19, 2026 13:17

kibertoad marked this pull request as ready for review March 19, 2026 13:18

Apply the patch

95c4cc4

--- Signed-off-by: Igor Savin <iselwin@gmail.com>

mcollina approved these changes Mar 19, 2026

View reviewed changes

ShogunPanda merged commit 9daffe3 into platformatic:main Mar 19, 2026
34 of 43 checks passed

mcollina mentioned this pull request Apr 3, 2026

Backpressure broken in v1.31 #260

Open

Conversation

kibertoad commented Mar 17, 2026

Problem

Root cause

Sequence leading to stall

Fix

Reproduction

Uh oh!

Uh oh!

kibertoad Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

mcollina Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

kibertoad Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

mcollina Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

kibertoad Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kibertoad commented Mar 17, 2026

Uh oh!

ShogunPanda commented Mar 19, 2026

Uh oh!

kibertoad commented Mar 19, 2026

Uh oh!

kibertoad commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mcollina Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

kibertoad commented Mar 19, 2026

Uh oh!

mcollina commented Mar 19, 2026

Uh oh!

kibertoad commented Mar 19, 2026

Uh oh!

kibertoad commented Mar 19, 2026

Uh oh!

ShogunPanda commented Mar 19, 2026

Uh oh!

ShogunPanda commented Mar 19, 2026

Uh oh!

kibertoad commented Mar 19, 2026

Uh oh!

mcollina left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kibertoad commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kibertoad commented Mar 19, 2026 •

edited

Loading