Skip to content

Continuous batching: output queue requeue starvation and request-scoped iterator does not terminate on completion #42943

@pythongiant

Description

@pythongiant

Description

There are two related correctness issues in the continuous batching result consumption logic that can lead to unfairness and non-terminating iterators under concurrent workloads.

1. Starvation and incorrect timeout handling in get_result

ContinuousBatchingManager.get_result currently retrieves a single item from the shared output queue and immediately re-queues it if the request_id does not match, returning None afterward. Under concurrent requests, this can lead to:

  • Starvation when mismatched outputs are repeatedly re-queued
  • Timeout semantics that are not respected, since re-queueing returns early instead of continuing to search within the remaining timeout
  • Unfair consumption behavior that depends on queue ordering rather than request progress

This behavior is observable when multiple streaming requests are active and results are interleaved in the output queue.

2. request_id_iter does not terminate after normal completion

request_id_iter currently exits only when a request is cancelled or the generation thread terminates. For requests that complete normally, the iterator continues polling indefinitely after the final FINISHED output has been yielded.

This can result in:

  • Infinite iteration loops for request-scoped consumers
  • Unexpected blocking behavior in streaming-style usage
  • Reliance on caller-side logic to manually stop iteration

The iterator should terminate once a terminal FINISHED output is observed for the given request.


Expected behavior

  • get_result should fairly search for a matching result within the specified timeout without starvation or early return.
  • request_id_iter should stop iterating once the request reaches a terminal finished state, in addition to cancellation or thread termination.

Proposed fix

A minimal, backward-compatible fix (#42942) can:

  • Defer re-queuing mismatched outputs until a matching result is found or the timeout expires
  • Explicitly terminate request_id_iter when a GenerationOutput reports a finished state

This preserves existing APIs, streaming semantics, and benchmarking behavior while fixing the correctness issues.


Environment

  • Transformers version: main
  • Feature: continuous batching
  • Device: CPU / CUDA (independent of backend)

Additional context

These issues are easiest to reproduce with multiple concurrent streaming requests sharing a single ContinuousBatchingManager.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions