Fix MultiCommTest port conflict by destroying _root_store between tests by rmahidhar · Pull Request #2075 · meta-pytorch/torchcomms

rmahidhar · 2026-04-14T23:28:54Z

Summary:

Impact

Category	Issues	Root Cause	Fixed by this change?
Python 1x8	40	`_root_store` persistence	YES — verified with `buck test`
Python 2x8 (real failures)	~4	Same `_root_store` persistence	YES — same bug, same fix (if GPU available)
Python 2x8 (infra)	~4	CI cancelled/quota	No — needs GPU capacity
C++ 2x8	19	CI `RESOURCE_EXHAUSTED`	No — needs GPU capacity
C++ 1x8	0	No failures exist	N/A

This change addresses ~44 of the 67 open MultiCommTest test issues. The remaining 19 C++ issues and ~4 Python 2x8 issues are CI GPU quota problems — no code fix needed.

Investigation

Investigated 67 open MultiCommTest test issues (48 Python, 19 C++) across the ncclx oncall.

C++ (19 issues, all 2x8): All recent CI runs show RESOURCE_EXHAUSTED or cancelled — pure infrastructure failures due to GPU quota exhaustion. The tests never actually executed. C++ 1x8 tests all pass, confirming no code bug exists on the C++ side.

Python (48 issues, 40 1x8 + 8 2x8): Real test failures. CI logs show the test binary hangs during the second communicator bootstrap (SIGTERM after 180s timeout). The first communicator (comms_test_1) initializes successfully, but the second one hangs during TorchCommNCCLXBootstrap unique ID exchange.

Root Cause

The Python test helper TorchCommTestHelpers.py maintains a module-level _root_store (a TCPStore bound to MASTER_PORT) that persists across test methods:

_root_store = None  # module-level global

def create_store():
    global _root_store
    if _root_store is None:
        _root_store = TCPStore(host, port=MASTER_PORT, ...)  # binds port
    return PrefixStore(f"test_comm_{N}", _root_store)

Python unittest runs test methods alphabetically within the MultiCommTest class. Store-based tests (e.g., test_mixed_ops_separate_stores) call create_store(), initializing _root_store on MASTER_PORT. This store is never destroyed.

When a subsequent no-store test (e.g., test_two_comms_no_store) creates TorchCommTestWrapper() without a store, the NCCLx bootstrap internally calls createPrefixStore() which tries to create a new TCPStore on the same MASTER_PORT — but it is already held by the persistent _root_store, causing a hang.

Options Considered

Workaround (D99444078): Replace all TorchCommTestWrapper() (no-store) calls with TorchCommTestWrapper(store=create_store()). Makes tests pass but eliminates no-store path test coverage and does not fix the underlying bug. New tests using the no-store path would hit the same issue.
Destroy _root_store in tearDown: Clean up after every test. Rejected because tearDown runs on each rank independently without synchronization — rank 0 could destroy the TCPStore server while other ranks still hold stale client references, causing DistNetworkError on the next test.
Destroy _root_store after barrier in store-based tests (chosen): Each store-based test already ends with store_deletion_barrier() which synchronizes all ranks via NCCL barrier. Adding destroy_root_store() after the barrier ensures all ranks have released their PrefixStore references before the TCPStore is destroyed. This is safe because the barrier guarantees synchronization.

Fix Details

TorchCommTestHelpers.py: Added destroy_root_store() helper that sets the global _root_store to None, destroying the TCPStore and freeing MASTER_PORT.

MultiCommTest.py:

Store-based tests (3 tests): Added destroy_root_store() after the existing store_deletion_barrier() at the end of each test, freeing MASTER_PORT for subsequent no-store tests.
No-store tests (3 tests): Reverted from D99444078 workaround back to original TorchCommTestWrapper() without stores, restoring actual no-store bootstrap path coverage.
Mixed-store tests (3 tests): Call store_deletion_barrier() then destroy_root_store() before creating the no-store communicator, freeing MASTER_PORT for the internal bootstrap.

Verification

Tested with buck test across 7 variants:

ranksize={none,auto,manual} x ncclx baseline: all PASS
ranksize=none x gloo: PASS
ranksize=mpi x ncclx fast init: PASS
ranksize=auto x ncclx expseg+ctran: PASS
C++ 1x8 ncclx: PASS (unaffected by Python-only change)

Without fix: FAIL (test_two_comms_no_store hangs, killed by SIGTERM)
With fix: PASS (all 9 test methods pass)

Reviewed By: lilyjanjigian

Differential Revision: D100866929

Summary: ## Impact | Category | Issues | Root Cause | Fixed by this change? | |----------|--------|------------|----------------------| | Python 1x8 | 40 | `_root_store` persistence | **YES** — verified with `buck test` | | Python 2x8 (real failures) | ~4 | Same `_root_store` persistence | **YES** — same bug, same fix (if GPU available) | | Python 2x8 (infra) | ~4 | CI cancelled/quota | No — needs GPU capacity | | C++ 2x8 | 19 | CI `RESOURCE_EXHAUSTED` | No — needs GPU capacity | | C++ 1x8 | 0 | No failures exist | N/A | **This change addresses ~44 of the 67 open MultiCommTest test issues.** The remaining 19 C++ issues and ~4 Python 2x8 issues are CI GPU quota problems — no code fix needed. ## Investigation Investigated 67 open MultiCommTest test issues (48 Python, 19 C++) across the ncclx oncall. **C++ (19 issues, all 2x8):** All recent CI runs show `RESOURCE_EXHAUSTED` or `cancelled` — pure infrastructure failures due to GPU quota exhaustion. The tests never actually executed. C++ 1x8 tests all pass, confirming no code bug exists on the C++ side. **Python (48 issues, 40 1x8 + 8 2x8):** Real test failures. CI logs show the test binary hangs during the second communicator bootstrap (SIGTERM after 180s timeout). The first communicator (`comms_test_1`) initializes successfully, but the second one hangs during `TorchCommNCCLXBootstrap` unique ID exchange. ## Root Cause The Python test helper `TorchCommTestHelpers.py` maintains a module-level `_root_store` (a `TCPStore` bound to `MASTER_PORT`) that persists across test methods: ```python _root_store = None # module-level global def create_store(): global _root_store if _root_store is None: _root_store = TCPStore(host, port=MASTER_PORT, ...) # binds port return PrefixStore(f"test_comm_{N}", _root_store) ``` Python unittest runs test methods alphabetically within the `MultiCommTest` class. Store-based tests (e.g., `test_mixed_ops_separate_stores`) call `create_store()`, initializing `_root_store` on `MASTER_PORT`. This store is never destroyed. When a subsequent no-store test (e.g., `test_two_comms_no_store`) creates `TorchCommTestWrapper()` without a store, the NCCLx bootstrap internally calls `createPrefixStore()` which tries to create a new `TCPStore` on the same `MASTER_PORT` — but it is already held by the persistent `_root_store`, causing a hang. ## Options Considered 1. **Workaround (D99444078):** Replace all `TorchCommTestWrapper()` (no-store) calls with `TorchCommTestWrapper(store=create_store())`. Makes tests pass but eliminates no-store path test coverage and does not fix the underlying bug. New tests using the no-store path would hit the same issue. 2. **Destroy `_root_store` in `tearDown`:** Clean up after every test. Rejected because `tearDown` runs on each rank independently without synchronization — rank 0 could destroy the TCPStore server while other ranks still hold stale client references, causing `DistNetworkError` on the next test. 3. **Destroy `_root_store` after barrier in store-based tests (chosen):** Each store-based test already ends with `store_deletion_barrier()` which synchronizes all ranks via NCCL barrier. Adding `destroy_root_store()` after the barrier ensures all ranks have released their PrefixStore references before the TCPStore is destroyed. This is safe because the barrier guarantees synchronization. ## Fix Details **`TorchCommTestHelpers.py`:** Added `destroy_root_store()` helper that sets the global `_root_store` to `None`, destroying the `TCPStore` and freeing `MASTER_PORT`. **`MultiCommTest.py`:** - **Store-based tests (3 tests):** Added `destroy_root_store()` after the existing `store_deletion_barrier()` at the end of each test, freeing `MASTER_PORT` for subsequent no-store tests. - **No-store tests (3 tests):** Reverted from D99444078 workaround back to original `TorchCommTestWrapper()` without stores, restoring actual no-store bootstrap path coverage. - **Mixed-store tests (3 tests):** Call `store_deletion_barrier()` then `destroy_root_store()` before creating the no-store communicator, freeing `MASTER_PORT` for the internal bootstrap. ## Verification Tested with `buck test` across 7 variants: - `ranksize={none,auto,manual}` x ncclx baseline: **all PASS** - `ranksize=none` x gloo: **PASS** - `ranksize=mpi` x ncclx fast init: **PASS** - `ranksize=auto` x ncclx expseg+ctran: **PASS** - C++ 1x8 ncclx: **PASS** (unaffected by Python-only change) Without fix: FAIL (`test_two_comms_no_store` hangs, killed by SIGTERM) With fix: PASS (all 9 test methods pass) Reviewed By: lilyjanjigian Differential Revision: D100866929

meta-codesync · 2026-04-14T23:29:02Z

@rmahidhar has exported this pull request. If you are a Meta employee, you can view the originating Diff in D100866929.

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 14, 2026

meta-codesync bot added fb-exported meta-exported labels Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix MultiCommTest port conflict by destroying _root_store between tests#2075

Fix MultiCommTest port conflict by destroying _root_store between tests#2075
rmahidhar wants to merge 1 commit intometa-pytorch:mainfrom
rmahidhar:export-D100866929

rmahidhar commented Apr 14, 2026

Uh oh!

meta-codesync bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rmahidhar commented Apr 14, 2026

Impact

Investigation

Root Cause

Options Considered

Fix Details

Verification

Uh oh!

meta-codesync bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant