Fix MultiCommTest port conflict by destroying _root_store between tests#2075
Open
rmahidhar wants to merge 1 commit intometa-pytorch:mainfrom
Open
Fix MultiCommTest port conflict by destroying _root_store between tests#2075rmahidhar wants to merge 1 commit intometa-pytorch:mainfrom
rmahidhar wants to merge 1 commit intometa-pytorch:mainfrom
Conversation
Summary:
## Impact
| Category | Issues | Root Cause | Fixed by this change? |
|----------|--------|------------|----------------------|
| Python 1x8 | 40 | `_root_store` persistence | **YES** — verified with `buck test` |
| Python 2x8 (real failures) | ~4 | Same `_root_store` persistence | **YES** — same bug, same fix (if GPU available) |
| Python 2x8 (infra) | ~4 | CI cancelled/quota | No — needs GPU capacity |
| C++ 2x8 | 19 | CI `RESOURCE_EXHAUSTED` | No — needs GPU capacity |
| C++ 1x8 | 0 | No failures exist | N/A |
**This change addresses ~44 of the 67 open MultiCommTest test issues.** The remaining 19 C++ issues and ~4 Python 2x8 issues are CI GPU quota problems — no code fix needed.
## Investigation
Investigated 67 open MultiCommTest test issues (48 Python, 19 C++) across the ncclx oncall.
**C++ (19 issues, all 2x8):** All recent CI runs show `RESOURCE_EXHAUSTED` or `cancelled` — pure infrastructure failures due to GPU quota exhaustion. The tests never actually executed. C++ 1x8 tests all pass, confirming no code bug exists on the C++ side.
**Python (48 issues, 40 1x8 + 8 2x8):** Real test failures. CI logs show the test binary hangs during the second communicator bootstrap (SIGTERM after 180s timeout). The first communicator (`comms_test_1`) initializes successfully, but the second one hangs during `TorchCommNCCLXBootstrap` unique ID exchange.
## Root Cause
The Python test helper `TorchCommTestHelpers.py` maintains a module-level `_root_store` (a `TCPStore` bound to `MASTER_PORT`) that persists across test methods:
```python
_root_store = None # module-level global
def create_store():
global _root_store
if _root_store is None:
_root_store = TCPStore(host, port=MASTER_PORT, ...) # binds port
return PrefixStore(f"test_comm_{N}", _root_store)
```
Python unittest runs test methods alphabetically within the `MultiCommTest` class. Store-based tests (e.g., `test_mixed_ops_separate_stores`) call `create_store()`, initializing `_root_store` on `MASTER_PORT`. This store is never destroyed.
When a subsequent no-store test (e.g., `test_two_comms_no_store`) creates `TorchCommTestWrapper()` without a store, the NCCLx bootstrap internally calls `createPrefixStore()` which tries to create a new `TCPStore` on the same `MASTER_PORT` — but it is already held by the persistent `_root_store`, causing a hang.
## Options Considered
1. **Workaround (D99444078):** Replace all `TorchCommTestWrapper()` (no-store) calls with `TorchCommTestWrapper(store=create_store())`. Makes tests pass but eliminates no-store path test coverage and does not fix the underlying bug. New tests using the no-store path would hit the same issue.
2. **Destroy `_root_store` in `tearDown`:** Clean up after every test. Rejected because `tearDown` runs on each rank independently without synchronization — rank 0 could destroy the TCPStore server while other ranks still hold stale client references, causing `DistNetworkError` on the next test.
3. **Destroy `_root_store` after barrier in store-based tests (chosen):** Each store-based test already ends with `store_deletion_barrier()` which synchronizes all ranks via NCCL barrier. Adding `destroy_root_store()` after the barrier ensures all ranks have released their PrefixStore references before the TCPStore is destroyed. This is safe because the barrier guarantees synchronization.
## Fix Details
**`TorchCommTestHelpers.py`:** Added `destroy_root_store()` helper that sets the global `_root_store` to `None`, destroying the `TCPStore` and freeing `MASTER_PORT`.
**`MultiCommTest.py`:**
- **Store-based tests (3 tests):** Added `destroy_root_store()` after the existing `store_deletion_barrier()` at the end of each test, freeing `MASTER_PORT` for subsequent no-store tests.
- **No-store tests (3 tests):** Reverted from D99444078 workaround back to original `TorchCommTestWrapper()` without stores, restoring actual no-store bootstrap path coverage.
- **Mixed-store tests (3 tests):** Call `store_deletion_barrier()` then `destroy_root_store()` before creating the no-store communicator, freeing `MASTER_PORT` for the internal bootstrap.
## Verification
Tested with `buck test` across 7 variants:
- `ranksize={none,auto,manual}` x ncclx baseline: **all PASS**
- `ranksize=none` x gloo: **PASS**
- `ranksize=mpi` x ncclx fast init: **PASS**
- `ranksize=auto` x ncclx expseg+ctran: **PASS**
- C++ 1x8 ncclx: **PASS** (unaffected by Python-only change)
Without fix: FAIL (`test_two_comms_no_store` hangs, killed by SIGTERM)
With fix: PASS (all 9 test methods pass)
Reviewed By: lilyjanjigian
Differential Revision: D100866929
Contributor
|
@rmahidhar has exported this pull request. If you are a Meta employee, you can view the originating Diff in D100866929. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Impact
_root_storepersistencebuck test_root_storepersistenceRESOURCE_EXHAUSTEDThis change addresses ~44 of the 67 open MultiCommTest test issues. The remaining 19 C++ issues and ~4 Python 2x8 issues are CI GPU quota problems — no code fix needed.
Investigation
Investigated 67 open MultiCommTest test issues (48 Python, 19 C++) across the ncclx oncall.
C++ (19 issues, all 2x8): All recent CI runs show
RESOURCE_EXHAUSTEDorcancelled— pure infrastructure failures due to GPU quota exhaustion. The tests never actually executed. C++ 1x8 tests all pass, confirming no code bug exists on the C++ side.Python (48 issues, 40 1x8 + 8 2x8): Real test failures. CI logs show the test binary hangs during the second communicator bootstrap (SIGTERM after 180s timeout). The first communicator (
comms_test_1) initializes successfully, but the second one hangs duringTorchCommNCCLXBootstrapunique ID exchange.Root Cause
The Python test helper
TorchCommTestHelpers.pymaintains a module-level_root_store(aTCPStorebound toMASTER_PORT) that persists across test methods:Python unittest runs test methods alphabetically within the
MultiCommTestclass. Store-based tests (e.g.,test_mixed_ops_separate_stores) callcreate_store(), initializing_root_storeonMASTER_PORT. This store is never destroyed.When a subsequent no-store test (e.g.,
test_two_comms_no_store) createsTorchCommTestWrapper()without a store, the NCCLx bootstrap internally callscreatePrefixStore()which tries to create a newTCPStoreon the sameMASTER_PORT— but it is already held by the persistent_root_store, causing a hang.Options Considered
Workaround (D99444078): Replace all
TorchCommTestWrapper()(no-store) calls withTorchCommTestWrapper(store=create_store()). Makes tests pass but eliminates no-store path test coverage and does not fix the underlying bug. New tests using the no-store path would hit the same issue.Destroy
_root_storeintearDown: Clean up after every test. Rejected becausetearDownruns on each rank independently without synchronization — rank 0 could destroy the TCPStore server while other ranks still hold stale client references, causingDistNetworkErroron the next test.Destroy
_root_storeafter barrier in store-based tests (chosen): Each store-based test already ends withstore_deletion_barrier()which synchronizes all ranks via NCCL barrier. Addingdestroy_root_store()after the barrier ensures all ranks have released their PrefixStore references before the TCPStore is destroyed. This is safe because the barrier guarantees synchronization.Fix Details
TorchCommTestHelpers.py: Addeddestroy_root_store()helper that sets the global_root_storetoNone, destroying theTCPStoreand freeingMASTER_PORT.MultiCommTest.py:destroy_root_store()after the existingstore_deletion_barrier()at the end of each test, freeingMASTER_PORTfor subsequent no-store tests.TorchCommTestWrapper()without stores, restoring actual no-store bootstrap path coverage.store_deletion_barrier()thendestroy_root_store()before creating the no-store communicator, freeingMASTER_PORTfor the internal bootstrap.Verification
Tested with
buck testacross 7 variants:ranksize={none,auto,manual}x ncclx baseline: all PASSranksize=nonex gloo: PASSranksize=mpix ncclx fast init: PASSranksize=autox ncclx expseg+ctran: PASSWithout fix: FAIL (
test_two_comms_no_storehangs, killed by SIGTERM)With fix: PASS (all 9 test methods pass)
Reviewed By: lilyjanjigian
Differential Revision: D100866929