Skip to content

rt: shard the multi-thread inject queue to reduce remote spawn contention#7973

Open
alex wants to merge 1 commit intotokio-rs:masterfrom
alex:shard-remote-lock
Open

rt: shard the multi-thread inject queue to reduce remote spawn contention#7973
alex wants to merge 1 commit intotokio-rs:masterfrom
alex:shard-remote-lock

Conversation

@alex
Copy link
Contributor

@alex alex commented Mar 13, 2026

The multi-threaded scheduler's inject queue was protected by a single global mutex (shared with idle coordination state). Every remote task spawn — any spawn from outside a worker thread — acquired this lock, serializing concurrent spawners and limiting throughput.

This change introduces inject::Sharded, which splits the inject queue into up to 8 independent shards, each an existing Shared/Synced pair with its own mutex and cache-line padding.

Design:

  • Push: each thread is assigned a home shard on first push (via a global counter) and sticks with it. This keeps consecutive pushes from one thread cache-local while spreading distinct threads across distinct locks.
  • Pop: workers rotate through shards starting at their own index, skipping empty shards via per-shard atomic length. pop_n drains from one shard at a time to keep critical sections bounded.
  • Shard count: capped at 8 (and 1 under loom). Contention drops off steeply past a handful of shards, and is_empty()/len() scan all shards in the worker hot loop.
  • is_closed: a single Release atomic set after all shards are closed, so the shutdown check stays lock-free.

Random shard selection via context::thread_rng_n (as used in #7757 for the blocking pool) was measured and found to be 20-33% slower on remote_spawn at 8+ threads. The inject workload is a tight loop of trivial pushes where producer-side cache locality dominates: with RNG, a hot thread bounces between shard cache lines on every push; with sticky assignment it stays hot on one mutex and list tail. RNG did win slightly (5-9%) on single-producer benchmarks where spreading tasks lets workers pop in parallel, but not enough to offset the regression at scale.

The inject state is removed from the global Synced mutex, which now only guards idle coordination. This also helps the single-threaded path since remote pushes no longer contend with worker park/unpark.

Results on remote_spawn benchmark (12,800 no-op tasks, N spawner threads, 64-core box):

threads before after improvement
1 9.38 ms 7.33 ms -22%
2 14.94 ms 6.64 ms -56%
4 23.69 ms 5.34 ms -77%
8 34.81 ms 4.69 ms -87%
16 32.33 ms 4.54 ms -86%
32 30.37 ms 4.73 ms -84%
64 26.59 ms 5.34 ms -80%

rt_multi_threaded benchmarks: spawn_many_local -8%, spawn_many_remote_idle -7%, yield_many -1%, rest neutral.

Developed in conjunction with Claude.

@github-actions github-actions bot added R-loom-current-thread Run loom current-thread tests on this PR R-loom-multi-thread Run loom multi-thread tests on this PR labels Mar 13, 2026
@alex alex force-pushed the shard-remote-lock branch 2 times, most recently from e4fa3a2 to adfa19f Compare March 13, 2026 23:54
…tion

The multi-threaded scheduler's inject queue was protected by a single
global mutex (shared with idle coordination state). Every remote task
spawn — any spawn from outside a worker thread — acquired this lock,
serializing concurrent spawners and limiting throughput.

This change introduces `inject::Sharded`, which splits the inject queue
into up to 8 independent shards, each an existing `Shared`/`Synced`
pair with its own mutex and cache-line padding.

Design:
- Push: each thread is assigned a home shard on first push (via a
  global counter) and sticks with it. This keeps consecutive pushes
  from one thread cache-local while spreading distinct threads across
  distinct locks.
- Pop: workers rotate through shards starting at their own index,
  skipping empty shards via per-shard atomic length. pop_n drains from
  one shard at a time to keep critical sections bounded.
- Shard count: capped at 8 (and 1 under loom). Contention drops off
  steeply past a handful of shards, and is_empty()/len() scan all
  shards in the worker hot loop.
- is_closed: a single Release atomic set after all shards are closed,
  so the shutdown check stays lock-free.

Random shard selection via context::thread_rng_n (as used in tokio-rs#7757 for
the blocking pool) was measured and found to be 20-33% slower on
remote_spawn at 8+ threads. The inject workload is a tight loop of
trivial pushes where producer-side cache locality dominates: with RNG,
a hot thread bounces between shard cache lines on every push; with
sticky assignment it stays hot on one mutex and list tail. RNG did win
slightly (5-9%) on single-producer benchmarks where spreading tasks
lets workers pop in parallel, but not enough to offset the regression
at scale.

The inject state is removed from the global Synced mutex, which now
only guards idle coordination. This also helps the single-threaded
path since remote pushes no longer contend with worker park/unpark.

Results on remote_spawn benchmark (12,800 no-op tasks, N spawner
threads, 64-core box):

  threads   before    after    improvement
  1         9.38 ms   7.33 ms  -22%
  2        14.94 ms   6.64 ms  -56%
  4        23.69 ms   5.34 ms  -77%
  8        34.81 ms   4.69 ms  -87%
  16       32.33 ms   4.54 ms  -86%
  32       30.37 ms   4.73 ms  -84%
  64       26.59 ms   5.34 ms  -80%

rt_multi_threaded benchmarks: spawn_many_local -8%, spawn_many_remote_idle
-7%, yield_many -1%, rest neutral.

Developed in conjunction with Claude.
@alex alex force-pushed the shard-remote-lock branch from adfa19f to de52c66 Compare March 13, 2026 23:59
@ADD-SP
Copy link
Member

ADD-SP commented Mar 14, 2026

Is the inject queue still a FIFO queue?

@ADD-SP ADD-SP added A-tokio Area: The main tokio crate M-runtime Module: tokio/runtime T-performance Topic: performance and benchmarks labels Mar 14, 2026
@alex
Copy link
Contributor Author

alex commented Mar 14, 2026

Approximately -- each of the queue shards is FIFO, but nothing attempts to ensure ordering cross-shard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-tokio Area: The main tokio crate M-runtime Module: tokio/runtime R-loom-current-thread Run loom current-thread tests on this PR R-loom-multi-thread Run loom multi-thread tests on this PR T-performance Topic: performance and benchmarks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants