Skip to content

Fix NVL domain tag collision in CtranTestBootstrap by creating isolated bootstrap for nvl domain#2078

Closed
Regina8023 wants to merge 1 commit intometa-pytorch:mainfrom
Regina8023:export-D98649856
Closed

Fix NVL domain tag collision in CtranTestBootstrap by creating isolated bootstrap for nvl domain#2078
Regina8023 wants to merge 1 commit intometa-pytorch:mainfrom
Regina8023:export-D98649856

Conversation

@Regina8023
Copy link
Copy Markdown
Contributor

Summary:
D98025086 introduced CtranTestBootstrap which implements NVL domain
operations (allGatherNvlDomain, barrierNvlDomain) via pairwise send/recv
with auto-incrementing tags starting from tag=0.

During ctranInit(), CtranAlgo::SharedResource calls these NVL domain
operations, which write data to TcpStoreBootstrap under a fixed key
keyed by (sender, receiver, tag). Later, test code calling
bootstrap_->send()/recv() with tag=0 directly (e.g. sockSend/sockRecv
in CtranIbDistUT) collides with these NVL domain keys.

Since TcpStoreBootstrap::recv does store_->wait(key) followed by
store_->get(key), and the key already exists from the NVL domain
operation, wait() returns immediately with stale NVL data (wrong size),
causing a data size mismatch and test failure.

Fix: add IBootstrap::duplicate() — a virtual method that each bootstrap type implements to create a new instance with isolated send/recv namespace:

  • TcpStoreBootstrap: wraps the underlying store in c10d::PrefixStore so all keys are prefixed and never collide with the original bootstrap.
  • MpiBootstrap: duplicates the communicator via MPI_Comm_dup so tags on the new communicator are independent.
  • Default: returns nullptr (callers must override).

CtranTestBootstrap calls duplicate() in its constructor and asserts the result is non-null. NVL-domain operations use the isolated bootstrap for pairwise send/recv, while global operations (allGather, barrier, send, recv, broadcast) delegate to the primary bootstrap. No tag restrictions remain.

New tests added:

  • DirectSendRecvAndNvlDomainWithSameTagAreIndependent: NVL-domain allGather followed by direct send/recv with tag=0 — confirms no collision.
  • InterleavedDirectSendRecvAndNvlDomainAreIndependent: 3 rounds of interleaved direct send/recv and NVL-domain allGather, both using tag=0 — confirms repeated operations never interfere.

Reviewed By: minsii

Differential Revision: D98649856

…ed bootstrap for nvl domain

Summary:
D98025086 introduced CtranTestBootstrap which implements NVL domain
operations (allGatherNvlDomain, barrierNvlDomain) via pairwise send/recv
with auto-incrementing tags starting from tag=0.

During ctranInit(), CtranAlgo::SharedResource calls these NVL domain
operations, which write data to TcpStoreBootstrap under a fixed key
keyed by (sender, receiver, tag). Later, test code calling
bootstrap_->send()/recv() with tag=0 directly (e.g. sockSend/sockRecv
in CtranIbDistUT) collides with these NVL domain keys.

Since TcpStoreBootstrap::recv does store_->wait(key) followed by
store_->get(key), and the key already exists from the NVL domain
operation, wait() returns immediately with stale NVL data (wrong size),
causing a data size mismatch and test failure.

**Fix**: add IBootstrap::duplicate() — a virtual method that each bootstrap type implements to create a new instance with isolated send/recv namespace:

- TcpStoreBootstrap: wraps the underlying store in c10d::PrefixStore so all keys are prefixed and never collide with the original bootstrap.
- MpiBootstrap: duplicates the communicator via MPI_Comm_dup so tags on the new communicator are independent.
- Default: returns nullptr (callers must override).

CtranTestBootstrap calls `duplicate()` in its constructor and asserts the result is non-null. NVL-domain operations use the isolated bootstrap for pairwise send/recv, while global operations (allGather, barrier, send, recv, broadcast) delegate to the primary bootstrap. No tag restrictions remain.

**New tests added**:

- DirectSendRecvAndNvlDomainWithSameTagAreIndependent: NVL-domain allGather followed by direct send/recv with tag=0 — confirms no collision.
- InterleavedDirectSendRecvAndNvlDomainAreIndependent: 3 rounds of interleaved direct send/recv and NVL-domain allGather, both using tag=0 — confirms repeated operations never interfere.

Reviewed By: minsii

Differential Revision: D98649856
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 15, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync bot commented Apr 15, 2026

@Regina8023 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D98649856.

@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync bot commented Apr 15, 2026

This pull request has been merged in 41b8b99.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported Merged meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant