Skip to content

util: add AtomicBool for lock-free is_cancelled check#7874

Open
HueCodes wants to merge 4 commits intotokio-rs:masterfrom
HueCodes:atomic-is-cancelled
Open

util: add AtomicBool for lock-free is_cancelled check#7874
HueCodes wants to merge 4 commits intotokio-rs:masterfrom
HueCodes:atomic-is-cancelled

Conversation

@HueCodes
Copy link

@HueCodes HueCodes commented Jan 31, 2026

Motivation

CancellationToken::is_cancelled() acquires a mutex on every call, causing >2ms delays on real-time systems.

Fixes: #7775

Solution

Add an AtomicBool field to TreeNode that mirrors the is_cancelled state. The atomic is updated with Release ordering when cancellation occurs, and read with Acquire ordering in is_cancelled(), providing lock-free reads without mutex contention.

Add an AtomicBool field to TreeNode that mirrors the is_cancelled
state. The atomic is updated with Release ordering when cancellation
occurs, and read with Acquire ordering in is_cancelled(), providing
lock-free reads without mutex contention.

Fixes: tokio-rs#7775
@github-actions github-actions bot added the R-loom-util Run loom tokio-util tests on this PR label Jan 31, 2026
@ADD-SP ADD-SP added A-tokio-util Area: The tokio-util crate M-sync Module: tokio/sync labels Feb 3, 2026
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please delete this field to avoid duplication.

Remove the `is_cancelled: bool` field from `Inner` to avoid
duplication with the `AtomicBool` on `TreeNode`. All reads and
writes now go through the atomic. The atomic stores are moved
before the lock drop so that locked readers see a consistent
value.
Comment on lines 85 to +87
/// Returns whether or not the node is cancelled
pub(crate) fn is_cancelled(node: &Arc<TreeNode>) -> bool {
node.inner.lock().unwrap().is_cancelled
node.is_cancelled.load(Ordering::Acquire)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Acquire is not actually strong enough here. When WaitForCancellationFuture::poll() returns pending in parallel with being woken up, we currently rely on the following logic:

  1. Inside WaitForCancellationFuture, we create a Notified future.
  2. Inside WaitForCancellationFuture, we read false from is_cancelled.
  3. Inside CancellationToken::cancel(), we write true to is_cancelled.
  4. Inside CancellationToken::cancel(), we call notify_waiters().

And here it is really important that notify_waiters() will wake up the Notified future created in step 1. Otherwise the WaitForCancellationFuture might sleep forever. With a mutex, that's fine, but with the current atomics, it's not actually satisfied!

That's because what we need is that step 2 happens-before step 3. However, for that to be true, operation 2 must be Release, and operation 3 must be Acquire. Unfortunately, it's reversed! Operation 2 is Acquire, and operation 3 is Release.

Unfortunately, a load() can't be Release and a store() can't be Acquire, so we have to upgrade them to read-modify-write operations for this to work:

  1. Use AtomicBool::swap(true, AcqRel) to write true.
  2. Use AtomicBool::compare_exchange(false, false, AcqRel, Acquire) to read. (fetch_and(true, AcqRel) would also work)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we can just take the mutex in WaitForCancellationFuture::poll(), and leave the AtomicBool optimization for users of is_cancelled() only.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for the review and explaining the memory order issue. I am learning a ton.

I've gone with your simpler suggestion and added a separate
is_cancelled_with_lock() function that takes the mutex for proper synchronization in
WaitForCancellationFuture::poll(), while keeping the lock-free AtomicBool path for direct
is_cancelled() calls.

I've also just pushed a fix for the cargo-spellcheck CI failures

Let me know if you need anything else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO using both AtomicBool and Mutex for the same purpose but for different use cases is not a good idea. It makes the reasoning harder - should I use the atomics based solution or the mutex one for use case 3 (a use case from the future) ?!
If it could be implemented fully with atomics (swap + compare_exchange) it would be better!

A good start would be to add an integration (or bench) test that reproduces the problem with Acquire+Release and then fixing it by using swap+compare_exchange.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by a use-case from the future? As #7775 indicates, there are users today that call is_cancelled() instead of .awaiting a future, and such users would get the benefit today.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and such users would get the benefit today

This is clear!

What bothers me is that some use cases could use is_cancelled() (lock-free, for both external and internal users) and other use cases (internal user) like WaitForCancellationFuture::poll() can't and they need to depend on the Mutex.
With a "use case from the future" I mean the second - a new Tokio internal functionality that needs to know whether a token is cancelled or not. What should the developer use - is_cancelled() or is_cancelled_with_lock() ?! I guess the answer is it depends.
But if by using swap+compare_exchange there won't be a need of is_cancelled_with_lock() then the decision will be simple.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there are any changes I should make let me know!

Add is_cancelled_with_lock() method that acquires the mutex before
checking the atomic flag. Use this in WaitForCancellationFuture::poll()
and WaitForCancellationFutureOwned::poll() to ensure proper memory
ordering guarantees.

The original Acquire/Release atomic ordering was insufficient to
guarantee that notify_waiters() would wake the Notified future created
in poll(). The mutex lock ensures the required happens-before relationship
between the is_cancelled check and the cancellation flag write.

The lock-free atomic optimization remains in place for general
is_cancelled() calls, preserving the fix for issue tokio-rs#7775.
Fix cargo-spellcheck CI failures by wrapping type and method names
(WaitForCancellationFuture::poll(), Mutex) in backticks so they are
ignored by the spellchecker.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-tokio-util Area: The tokio-util crate M-sync Module: tokio/sync R-loom-util Run loom tokio-util tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CancellationToken is_cancelled locks on a mutex, caused occasional >2ms delays

4 participants