Skip to content

Reproduction and fix for backpressure processing stalling#254

Merged
ShogunPanda merged 14 commits intoplatformatic:mainfrom
kibertoad:fix/backpressure-stall
Mar 19, 2026
Merged

Reproduction and fix for backpressure processing stalling#254
ShogunPanda merged 14 commits intoplatformatic:mainfrom
kibertoad:fix/backpressure-stall

Conversation

@kibertoad
Copy link
Copy Markdown
Contributor

Problem

When MessagesStream is consumed via pipeline() through a downstream Duplex that triggers backpressure (e.g. a batching stream that groups messages by topic-partition), the internal fetch loop can die permanently, leaving unconsumed messages in Kafka.

This happens in production when a consumer uses pipeline(consumerStream, batchStream) with a batch-accumulating Duplex. Under moderate load (e.g. 15 topics ×1000 messages), the consumer stalls after processing ~30% of messages.

Root cause

Two bugs in the fetch loop lifecycle:

  1. canPush gate in #pushRecords kills the loop

After pushing fetched messages to the readable buffer, the next #fetch() was only scheduled when push() returned true (buffer below highWaterMark). When push() returned false, the loop relied on _read() to restart it. In pull mode this works, but in flowing mode with pipeline(), _read() is not reliably called again when the buffer is already empty - Node.js considers the stream "already flowing" and skips the _read() call. The fetch loop dies.

  1. resume() doesn't restart the fetch loop after backpressure

When the downstream Duplex's write() returns false, pipeline()'s internal pipe() calls pause() on the consumer stream, setting #paused = true. If a previously scheduled process.nextTick(#fetch) fires while #paused is still true, #fetch() returns early without scheduling another iteration — the loop is now dead. When pipe() later calls resume() (after the downstream drains), super.resume() does not reliably trigger _read() when the readable buffer is empty and the stream is in flowing mode.

Sequence leading to stall

  1. #pushRecords pushes N messages → push() returns false (buffer > highWaterMark)
  2. process.nextTick(#fetch) is scheduled
  3. pipeline's pipe() calls pause() → #paused = true
    (downstream batch stream's write() returned false — backpressure)
  4. nextTick fires → #fetch() sees #paused = true → returns (loop dead)
  5. Downstream drains → pipe() calls resume() → #paused = false
  6. super.resume() does NOT trigger _read() (stream already in flowing mode)
  7. No more fetches. Consumer sits idle with unconsumed messages in Kafka.

Fix

Fix 1: Remove the canPush gate — always schedule process.nextTick(#fetch) after #pushRecords. This is safe because
#fetch() checks #paused, #closed, and other guards before issuing a Kafka fetch request.

Fix 2: In resume(), when transitioning from paused → unpaused, schedule process.nextTick(#fetch) to explicitly restart
the loop. A wasPaused guard prevents firing during initial pipeline() setup (where resume() is called before
_construct() completes).

Reproduction

The load test in playground/load-tests/ reproduces the stall deterministically:

Start Kafka

docker compose up -d

Without fix: stalls at ~4500/15000 (30%)

npm run load:backpressure:light

With fix: 15000/15000 consumed

npm run load:backpressure:light

The test mirrors the real-world setup: consumer starts before publishing (fetches partial results as messages arrive),
messages are published interleaved across 15 topics, and a batch-accumulating Duplex downstream triggers
backpressure.

---
Signed-off-by: Igor Savin <iselwin@gmail.com>
//
// Unconditionally scheduling is safe because #fetch() checks #paused,
// #closed, and other guards before issuing a Kafka fetch request.
process.nextTick(() => {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actual fix 1

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kibertoad can you clarify why it's safe and provide some references?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See:

  #fetch () {
    /* c8 ignore next 4 - Hard to test */
    if (this.#closed || this.closed || this.destroyed) {
      this.push(null)
      return
    }
  1. If the stream was closed/destroyed during record processing, #fetch() returns immediately at line 487-490;
  2. If backpressure kicked in (pipeline called pause(), setting #paused = true), #fetch() returns at line 493 without issuing a Kafka request. The resume() override (line 336-361) will restart the loop when backpressure clears;
  3. If readableFlowing === null (no consumer attached yet), it also bails out at line 493;
  4. If offset refresh is happening, same early return.

Did I miss anything?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test that verifies under extreme load and non-consumption (stream paused), the memory is not ballooning? I see the tests are verifying non-stalling, not avoiding leaks.

return super.resume()
const result = super.resume()

// Restart the fetch loop when transitioning from paused → unpaused.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actual fix 2

---
Signed-off-by: Igor Savin <iselwin@gmail.com>
---
Signed-off-by: Igor Savin <iselwin@gmail.com>
@kibertoad
Copy link
Copy Markdown
Contributor Author

@ShogunPanda could you please take a look?

…/backpressure-stall

# Conflicts:
#	src/clients/consumer/messages-stream.ts
---
Signed-off-by: Igor Savin <iselwin@gmail.com>
…/backpressure-stall

# Conflicts:
#	src/clients/consumer/messages-stream.ts
---
Signed-off-by: Igor Savin <iselwin@gmail.com>
@ShogunPanda
Copy link
Copy Markdown
Contributor

@kibertoad In general it LGTM. I'm waiting for @mcollina to confirm that removing canPush won't cause harm.

In the meanwhile, do you think you can provide a test?

@kibertoad
Copy link
Copy Markdown
Contributor Author

@ShogunPanda I tried to, but it doesn't reproduce well within a test, unfortunately, as it needs significant load to surface. Let me try to reproduce exactly what the load script is doing...

---
Signed-off-by: Igor Savin <iselwin@gmail.com>
@kibertoad
Copy link
Copy Markdown
Contributor Author

kibertoad commented Mar 19, 2026

@ShogunPanda go figure - asking Claude to reproduce exactly the setup in the load test did the trick - added the test that fails without the fix.

//
// Unconditionally scheduling is safe because #fetch() checks #paused,
// #closed, and other guards before issuing a Kafka fetch request.
process.nextTick(() => {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test that verifies under extreme load and non-consumption (stream paused), the memory is not ballooning? I see the tests are verifying non-stalling, not avoiding leaks.

@kibertoad
Copy link
Copy Markdown
Contributor Author

@mcollina what pattern would you recommend for checking memory usage in tests?

@mcollina
Copy link
Copy Markdown
Member

---
Signed-off-by: Igor Savin <iselwin@gmail.com>
@kibertoad kibertoad marked this pull request as draft March 19, 2026 10:38
---
Signed-off-by: Igor Savin <iselwin@gmail.com>
---
Signed-off-by: Igor Savin <iselwin@gmail.com>
---
Signed-off-by: Igor Savin <iselwin@gmail.com>
---
Signed-off-by: Igor Savin <iselwin@gmail.com>
@kibertoad
Copy link
Copy Markdown
Contributor Author

@mcollina I've added the tests. Memory use one is not executed in CI, as it's too beefy for it, unfortunately.

@kibertoad kibertoad marked this pull request as ready for review March 19, 2026 13:18
@kibertoad
Copy link
Copy Markdown
Contributor Author

@mcollina failing test is just flaky, I think, could you rerun?

@ShogunPanda
Copy link
Copy Markdown
Contributor

@kibertoad Done

@ShogunPanda
Copy link
Copy Markdown
Contributor

@kibertoad In order to have the memory one running, I had to add:

memory-fix.patch

Do you mind checking it locally and eventually push it to your branch?

---
Signed-off-by: Igor Savin <iselwin@gmail.com>
@kibertoad
Copy link
Copy Markdown
Contributor Author

@ShogunPanda applied

Copy link
Copy Markdown
Member

@mcollina mcollina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ShogunPanda ShogunPanda merged commit 9daffe3 into platformatic:main Mar 19, 2026
34 of 43 checks passed
@kibertoad
Copy link
Copy Markdown
Contributor Author

@mcollina @ShogunPanda thank you! would it be possible to release a new version now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants