Skip to content

Fix colltrace/proxytrace test failures after NcclxBaseTestFixture migration#2070

Open
SuhitK wants to merge 2 commits intometa-pytorch:mainfrom
SuhitK:export-D100844598
Open

Fix colltrace/proxytrace test failures after NcclxBaseTestFixture migration#2070
SuhitK wants to merge 2 commits intometa-pytorch:mainfrom
SuhitK:export-D100844598

Conversation

@SuhitK
Copy link
Copy Markdown
Contributor

@SuhitK SuhitK commented Apr 14, 2026

Summary:
Fixes all test failures introduced by the NcclxBaseTestFixture migration in D100723002.

Root causes and fixes:

  1. NCCL cvars not initialized before initEnv() call_once: NcclxBaseTestFixture::SetUp() calls initEnv() which has std::call_once — the NCCL debug logger is initialized exactly once per process. If NCCL_DEBUG, NCCL_DEBUG_SUBSYS, NCCL_COLLTRACE, and NCCL_PROXYTRACE are not set in SetUp() envs, they're empty when initEnv() runs, and the logger/colltrace init uses empty defaults. Per-test EnvRAII overrides come too late.

    • Fix: Added NCCL_DEBUG=INFO, NCCL_DEBUG_SUBSYS=INIT,COLL, NCCL_COLLTRACE=trace to CollTraceDistTest/MapperTraceDistTest SetUp envs. Added NCCL_PROXYTRACE=trace, NCCL_DEBUG, NCCL_DEBUG_SUBSYS to ProxyTraceDistTest SetUp envs.
  2. AlgoConfig overrides cvar globals: EnvRAII sets cvar globals but AlgoConfig caches algorithm selection at init. MixedCtranBaseline used EnvRAII for NCCL_ALLGATHER_ALGO which doesn't notify AlgoConfig.

    • Fix: Use testinfra::AlgoRAII (calls testOnlySetAlgo()) for NCCL_ALLGATHER_ALGO in CollTraceDistTest::MixedCtranBaseline. Added test_algo_utils_lib dep to colltrace_dist_nolocal.
  3. AlgoConfig singleton double registration: ncclx_meta_cpp_library creates shared libs that each embed nccl-internal (including AlgoConfig singleton). Test binary also links nccl-internal directly. At runtime, singleton registers 3x → crash.

    • Fix: Added preferred_linkage="static" to ncclx_meta_cpp_library in def_meta_test.bzl. Test utility libs now statically link into the binary — single singleton.
  4. proxyTraceAddBasicInfo only called for registered ring AllGather: In v2_29 enqueue.cc, the call was inside an if (proxyOp->reg && RING && handles) block. Regular AllReduce never hit this path → nChannels=0, coll=default.

    • Fix: Moved proxyTraceAddBasicInfo call before the if-block so it runs for all collectives.
  5. ProxyMock hook missing in v2.29: ProxyMockNetSendFailure::mock() was only called in v2_27/src/transport/net.cc. Never ported to v2.29 → mock configured but never intercepted sends → hang tests saw no active ops.

    • Fix: Added ProxyMock hook in v2_29/src/transport/net.cc around ncclNet->isend().

Reviewed By: lilyjanjigian

Differential Revision: D100844598

minsii and others added 2 commits April 14, 2026 12:01
…lxBaseTestFixture

Summary:
Migrate 5 colltrace test files from the legacy NcclxBaseTest (TestsDistUtils.h) to NcclxBaseTestFixture (comms/ncclx/meta/tests/NcclxBaseTest.h):
- CollTraceDistTest.cc, MapperTraceDistTest.cc, NewCollTraceDistTestLocal.cc, NewCollTraceDistTestNoLocal.cc, ProxyTraceDistTest.cc: replace base class, use NcclxEnvs param in SetUp(), add NcclxBaseTestFixture::TearDown(), remove redundant cudaSetDevice, use ncclx::test::NcclCommRAII
- MapperTraceDistTest.cc: convert 3-arg MPI-based NcclCommRAII to 4-arg bootstrap-based constructor
- NewCollTraceDistTestNoLocal.cc: fix MixedCtranBaseline to use AlgoRAII instead of EnvRAII for NCCL_ALLGATHER_ALGO override
- CollTraceDistTest.cc: add NCCL_COLLTRACE_USE_NEW_COLLTRACE=0 to avoid segfault from uninitialized new colltrace (FIXME for colltrace owner to migrate)
- BUCK: add version_deps for ncclx_base_test and nccl_comm_utils, add test_algo_utils_lib dep to new_colltrace_dist_nolocal

Differential Revision: D100723002
…ration

Summary:
Fixes all test failures introduced by the NcclxBaseTestFixture migration in D100723002.

Root causes and fixes:
1. **NCCL cvars not initialized before initEnv() call_once**: NcclxBaseTestFixture::SetUp() calls initEnv() which has std::call_once — the NCCL debug logger is initialized exactly once per process. If NCCL_DEBUG, NCCL_DEBUG_SUBSYS, NCCL_COLLTRACE, and NCCL_PROXYTRACE are not set in SetUp() envs, they're empty when initEnv() runs, and the logger/colltrace init uses empty defaults. Per-test EnvRAII overrides come too late.
   - Fix: Added NCCL_DEBUG=INFO, NCCL_DEBUG_SUBSYS=INIT,COLL, NCCL_COLLTRACE=trace to CollTraceDistTest/MapperTraceDistTest SetUp envs. Added NCCL_PROXYTRACE=trace, NCCL_DEBUG, NCCL_DEBUG_SUBSYS to ProxyTraceDistTest SetUp envs.

2. **AlgoConfig overrides cvar globals**: EnvRAII sets cvar globals but AlgoConfig caches algorithm selection at init. MixedCtranBaseline used EnvRAII for NCCL_ALLGATHER_ALGO which doesn't notify AlgoConfig.
   - Fix: Use testinfra::AlgoRAII (calls testOnlySetAlgo()) for NCCL_ALLGATHER_ALGO in CollTraceDistTest::MixedCtranBaseline. Added test_algo_utils_lib dep to colltrace_dist_nolocal.

3. **AlgoConfig singleton double registration**: ncclx_meta_cpp_library creates shared libs that each embed nccl-internal (including AlgoConfig singleton). Test binary also links nccl-internal directly. At runtime, singleton registers 3x → crash.
   - Fix: Added preferred_linkage="static" to ncclx_meta_cpp_library in def_meta_test.bzl. Test utility libs now statically link into the binary — single singleton.

4. **proxyTraceAddBasicInfo only called for registered ring AllGather**: In v2_29 enqueue.cc, the call was inside an `if (proxyOp->reg && RING && handles)` block. Regular AllReduce never hit this path → nChannels=0, coll=default.
   - Fix: Moved proxyTraceAddBasicInfo call before the if-block so it runs for all collectives.

5. **ProxyMock hook missing in v2.29**: ProxyMockNetSendFailure::mock() was only called in v2_27/src/transport/net.cc. Never ported to v2.29 → mock configured but never intercepted sends → hang tests saw no active ops.
   - Fix: Added ProxyMock hook in v2_29/src/transport/net.cc around ncclNet->isend().

Reviewed By: lilyjanjigian

Differential Revision: D100844598
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 14, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync bot commented Apr 14, 2026

@SuhitK has exported this pull request. If you are a Meta employee, you can view the originating Diff in D100844598.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants