Skip to content

Fix MultiCommTest port conflict by destroying _root_store between tests#2075

Open
rmahidhar wants to merge 1 commit intometa-pytorch:mainfrom
rmahidhar:export-D100866929
Open

Fix MultiCommTest port conflict by destroying _root_store between tests#2075
rmahidhar wants to merge 1 commit intometa-pytorch:mainfrom
rmahidhar:export-D100866929

Conversation

@rmahidhar
Copy link
Copy Markdown

Summary:

Impact

Category Issues Root Cause Fixed by this change?
Python 1x8 40 _root_store persistence YES — verified with buck test
Python 2x8 (real failures) ~4 Same _root_store persistence YES — same bug, same fix (if GPU available)
Python 2x8 (infra) ~4 CI cancelled/quota No — needs GPU capacity
C++ 2x8 19 CI RESOURCE_EXHAUSTED No — needs GPU capacity
C++ 1x8 0 No failures exist N/A

This change addresses ~44 of the 67 open MultiCommTest test issues. The remaining 19 C++ issues and ~4 Python 2x8 issues are CI GPU quota problems — no code fix needed.

Investigation

Investigated 67 open MultiCommTest test issues (48 Python, 19 C++) across the ncclx oncall.

C++ (19 issues, all 2x8): All recent CI runs show RESOURCE_EXHAUSTED or cancelled — pure infrastructure failures due to GPU quota exhaustion. The tests never actually executed. C++ 1x8 tests all pass, confirming no code bug exists on the C++ side.

Python (48 issues, 40 1x8 + 8 2x8): Real test failures. CI logs show the test binary hangs during the second communicator bootstrap (SIGTERM after 180s timeout). The first communicator (comms_test_1) initializes successfully, but the second one hangs during TorchCommNCCLXBootstrap unique ID exchange.

Root Cause

The Python test helper TorchCommTestHelpers.py maintains a module-level _root_store (a TCPStore bound to MASTER_PORT) that persists across test methods:

_root_store = None  # module-level global

def create_store():
    global _root_store
    if _root_store is None:
        _root_store = TCPStore(host, port=MASTER_PORT, ...)  # binds port
    return PrefixStore(f"test_comm_{N}", _root_store)

Python unittest runs test methods alphabetically within the MultiCommTest class. Store-based tests (e.g., test_mixed_ops_separate_stores) call create_store(), initializing _root_store on MASTER_PORT. This store is never destroyed.

When a subsequent no-store test (e.g., test_two_comms_no_store) creates TorchCommTestWrapper() without a store, the NCCLx bootstrap internally calls createPrefixStore() which tries to create a new TCPStore on the same MASTER_PORT — but it is already held by the persistent _root_store, causing a hang.

Options Considered

  1. Workaround (D99444078): Replace all TorchCommTestWrapper() (no-store) calls with TorchCommTestWrapper(store=create_store()). Makes tests pass but eliminates no-store path test coverage and does not fix the underlying bug. New tests using the no-store path would hit the same issue.

  2. Destroy _root_store in tearDown: Clean up after every test. Rejected because tearDown runs on each rank independently without synchronization — rank 0 could destroy the TCPStore server while other ranks still hold stale client references, causing DistNetworkError on the next test.

  3. Destroy _root_store after barrier in store-based tests (chosen): Each store-based test already ends with store_deletion_barrier() which synchronizes all ranks via NCCL barrier. Adding destroy_root_store() after the barrier ensures all ranks have released their PrefixStore references before the TCPStore is destroyed. This is safe because the barrier guarantees synchronization.

Fix Details

TorchCommTestHelpers.py: Added destroy_root_store() helper that sets the global _root_store to None, destroying the TCPStore and freeing MASTER_PORT.

MultiCommTest.py:

  • Store-based tests (3 tests): Added destroy_root_store() after the existing store_deletion_barrier() at the end of each test, freeing MASTER_PORT for subsequent no-store tests.
  • No-store tests (3 tests): Reverted from D99444078 workaround back to original TorchCommTestWrapper() without stores, restoring actual no-store bootstrap path coverage.
  • Mixed-store tests (3 tests): Call store_deletion_barrier() then destroy_root_store() before creating the no-store communicator, freeing MASTER_PORT for the internal bootstrap.

Verification

Tested with buck test across 7 variants:

  • ranksize={none,auto,manual} x ncclx baseline: all PASS
  • ranksize=none x gloo: PASS
  • ranksize=mpi x ncclx fast init: PASS
  • ranksize=auto x ncclx expseg+ctran: PASS
  • C++ 1x8 ncclx: PASS (unaffected by Python-only change)

Without fix: FAIL (test_two_comms_no_store hangs, killed by SIGTERM)
With fix: PASS (all 9 test methods pass)

Reviewed By: lilyjanjigian

Differential Revision: D100866929

Summary:
## Impact

| Category | Issues | Root Cause | Fixed by this change? |
|----------|--------|------------|----------------------|
| Python 1x8 | 40 | `_root_store` persistence | **YES** — verified with `buck test` |
| Python 2x8 (real failures) | ~4 | Same `_root_store` persistence | **YES** — same bug, same fix (if GPU available) |
| Python 2x8 (infra) | ~4 | CI cancelled/quota | No — needs GPU capacity |
| C++ 2x8 | 19 | CI `RESOURCE_EXHAUSTED` | No — needs GPU capacity |
| C++ 1x8 | 0 | No failures exist | N/A |

**This change addresses ~44 of the 67 open MultiCommTest test issues.** The remaining 19 C++ issues and ~4 Python 2x8 issues are CI GPU quota problems — no code fix needed.

## Investigation

Investigated 67 open MultiCommTest test issues (48 Python, 19 C++) across the ncclx oncall.

**C++ (19 issues, all 2x8):** All recent CI runs show `RESOURCE_EXHAUSTED` or `cancelled` — pure infrastructure failures due to GPU quota exhaustion. The tests never actually executed. C++ 1x8 tests all pass, confirming no code bug exists on the C++ side.

**Python (48 issues, 40 1x8 + 8 2x8):** Real test failures. CI logs show the test binary hangs during the second communicator bootstrap (SIGTERM after 180s timeout). The first communicator (`comms_test_1`) initializes successfully, but the second one hangs during `TorchCommNCCLXBootstrap` unique ID exchange.

## Root Cause

The Python test helper `TorchCommTestHelpers.py` maintains a module-level `_root_store` (a `TCPStore` bound to `MASTER_PORT`) that persists across test methods:

```python
_root_store = None  # module-level global

def create_store():
    global _root_store
    if _root_store is None:
        _root_store = TCPStore(host, port=MASTER_PORT, ...)  # binds port
    return PrefixStore(f"test_comm_{N}", _root_store)
```

Python unittest runs test methods alphabetically within the `MultiCommTest` class. Store-based tests (e.g., `test_mixed_ops_separate_stores`) call `create_store()`, initializing `_root_store` on `MASTER_PORT`. This store is never destroyed.

When a subsequent no-store test (e.g., `test_two_comms_no_store`) creates `TorchCommTestWrapper()` without a store, the NCCLx bootstrap internally calls `createPrefixStore()` which tries to create a new `TCPStore` on the same `MASTER_PORT` — but it is already held by the persistent `_root_store`, causing a hang.

## Options Considered

1. **Workaround (D99444078):** Replace all `TorchCommTestWrapper()` (no-store) calls with `TorchCommTestWrapper(store=create_store())`. Makes tests pass but eliminates no-store path test coverage and does not fix the underlying bug. New tests using the no-store path would hit the same issue.

2. **Destroy `_root_store` in `tearDown`:** Clean up after every test. Rejected because `tearDown` runs on each rank independently without synchronization — rank 0 could destroy the TCPStore server while other ranks still hold stale client references, causing `DistNetworkError` on the next test.

3. **Destroy `_root_store` after barrier in store-based tests (chosen):** Each store-based test already ends with `store_deletion_barrier()` which synchronizes all ranks via NCCL barrier. Adding `destroy_root_store()` after the barrier ensures all ranks have released their PrefixStore references before the TCPStore is destroyed. This is safe because the barrier guarantees synchronization.

## Fix Details

**`TorchCommTestHelpers.py`:** Added `destroy_root_store()` helper that sets the global `_root_store` to `None`, destroying the `TCPStore` and freeing `MASTER_PORT`.

**`MultiCommTest.py`:**
- **Store-based tests (3 tests):** Added `destroy_root_store()` after the existing `store_deletion_barrier()` at the end of each test, freeing `MASTER_PORT` for subsequent no-store tests.
- **No-store tests (3 tests):** Reverted from D99444078 workaround back to original `TorchCommTestWrapper()` without stores, restoring actual no-store bootstrap path coverage.
- **Mixed-store tests (3 tests):** Call `store_deletion_barrier()` then `destroy_root_store()` before creating the no-store communicator, freeing `MASTER_PORT` for the internal bootstrap.

## Verification

Tested with `buck test` across 7 variants:
- `ranksize={none,auto,manual}` x ncclx baseline: **all PASS**
- `ranksize=none` x gloo: **PASS**
- `ranksize=mpi` x ncclx fast init: **PASS**
- `ranksize=auto` x ncclx expseg+ctran: **PASS**
- C++ 1x8 ncclx: **PASS** (unaffected by Python-only change)

Without fix: FAIL (`test_two_comms_no_store` hangs, killed by SIGTERM)
With fix: PASS (all 9 test methods pass)

Reviewed By: lilyjanjigian

Differential Revision: D100866929
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 14, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync bot commented Apr 14, 2026

@rmahidhar has exported this pull request. If you are a Meta employee, you can view the originating Diff in D100866929.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant