Fix colltrace/proxytrace test failures after NcclxBaseTestFixture migration#2070
Open
SuhitK wants to merge 2 commits intometa-pytorch:mainfrom
Open
Fix colltrace/proxytrace test failures after NcclxBaseTestFixture migration#2070SuhitK wants to merge 2 commits intometa-pytorch:mainfrom
SuhitK wants to merge 2 commits intometa-pytorch:mainfrom
Conversation
…lxBaseTestFixture Summary: Migrate 5 colltrace test files from the legacy NcclxBaseTest (TestsDistUtils.h) to NcclxBaseTestFixture (comms/ncclx/meta/tests/NcclxBaseTest.h): - CollTraceDistTest.cc, MapperTraceDistTest.cc, NewCollTraceDistTestLocal.cc, NewCollTraceDistTestNoLocal.cc, ProxyTraceDistTest.cc: replace base class, use NcclxEnvs param in SetUp(), add NcclxBaseTestFixture::TearDown(), remove redundant cudaSetDevice, use ncclx::test::NcclCommRAII - MapperTraceDistTest.cc: convert 3-arg MPI-based NcclCommRAII to 4-arg bootstrap-based constructor - NewCollTraceDistTestNoLocal.cc: fix MixedCtranBaseline to use AlgoRAII instead of EnvRAII for NCCL_ALLGATHER_ALGO override - CollTraceDistTest.cc: add NCCL_COLLTRACE_USE_NEW_COLLTRACE=0 to avoid segfault from uninitialized new colltrace (FIXME for colltrace owner to migrate) - BUCK: add version_deps for ncclx_base_test and nccl_comm_utils, add test_algo_utils_lib dep to new_colltrace_dist_nolocal Differential Revision: D100723002
…ration Summary: Fixes all test failures introduced by the NcclxBaseTestFixture migration in D100723002. Root causes and fixes: 1. **NCCL cvars not initialized before initEnv() call_once**: NcclxBaseTestFixture::SetUp() calls initEnv() which has std::call_once — the NCCL debug logger is initialized exactly once per process. If NCCL_DEBUG, NCCL_DEBUG_SUBSYS, NCCL_COLLTRACE, and NCCL_PROXYTRACE are not set in SetUp() envs, they're empty when initEnv() runs, and the logger/colltrace init uses empty defaults. Per-test EnvRAII overrides come too late. - Fix: Added NCCL_DEBUG=INFO, NCCL_DEBUG_SUBSYS=INIT,COLL, NCCL_COLLTRACE=trace to CollTraceDistTest/MapperTraceDistTest SetUp envs. Added NCCL_PROXYTRACE=trace, NCCL_DEBUG, NCCL_DEBUG_SUBSYS to ProxyTraceDistTest SetUp envs. 2. **AlgoConfig overrides cvar globals**: EnvRAII sets cvar globals but AlgoConfig caches algorithm selection at init. MixedCtranBaseline used EnvRAII for NCCL_ALLGATHER_ALGO which doesn't notify AlgoConfig. - Fix: Use testinfra::AlgoRAII (calls testOnlySetAlgo()) for NCCL_ALLGATHER_ALGO in CollTraceDistTest::MixedCtranBaseline. Added test_algo_utils_lib dep to colltrace_dist_nolocal. 3. **AlgoConfig singleton double registration**: ncclx_meta_cpp_library creates shared libs that each embed nccl-internal (including AlgoConfig singleton). Test binary also links nccl-internal directly. At runtime, singleton registers 3x → crash. - Fix: Added preferred_linkage="static" to ncclx_meta_cpp_library in def_meta_test.bzl. Test utility libs now statically link into the binary — single singleton. 4. **proxyTraceAddBasicInfo only called for registered ring AllGather**: In v2_29 enqueue.cc, the call was inside an `if (proxyOp->reg && RING && handles)` block. Regular AllReduce never hit this path → nChannels=0, coll=default. - Fix: Moved proxyTraceAddBasicInfo call before the if-block so it runs for all collectives. 5. **ProxyMock hook missing in v2.29**: ProxyMockNetSendFailure::mock() was only called in v2_27/src/transport/net.cc. Never ported to v2.29 → mock configured but never intercepted sends → hang tests saw no active ops. - Fix: Added ProxyMock hook in v2_29/src/transport/net.cc around ncclNet->isend(). Reviewed By: lilyjanjigian Differential Revision: D100844598
Contributor
|
@SuhitK has exported this pull request. If you are a Meta employee, you can view the originating Diff in D100844598. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Fixes all test failures introduced by the NcclxBaseTestFixture migration in D100723002.
Root causes and fixes:
NCCL cvars not initialized before initEnv() call_once: NcclxBaseTestFixture::SetUp() calls initEnv() which has std::call_once — the NCCL debug logger is initialized exactly once per process. If NCCL_DEBUG, NCCL_DEBUG_SUBSYS, NCCL_COLLTRACE, and NCCL_PROXYTRACE are not set in SetUp() envs, they're empty when initEnv() runs, and the logger/colltrace init uses empty defaults. Per-test EnvRAII overrides come too late.
AlgoConfig overrides cvar globals: EnvRAII sets cvar globals but AlgoConfig caches algorithm selection at init. MixedCtranBaseline used EnvRAII for NCCL_ALLGATHER_ALGO which doesn't notify AlgoConfig.
AlgoConfig singleton double registration: ncclx_meta_cpp_library creates shared libs that each embed nccl-internal (including AlgoConfig singleton). Test binary also links nccl-internal directly. At runtime, singleton registers 3x → crash.
proxyTraceAddBasicInfo only called for registered ring AllGather: In v2_29 enqueue.cc, the call was inside an
if (proxyOp->reg && RING && handles)block. Regular AllReduce never hit this path → nChannels=0, coll=default.ProxyMock hook missing in v2.29: ProxyMockNetSendFailure::mock() was only called in v2_27/src/transport/net.cc. Never ported to v2.29 → mock configured but never intercepted sends → hang tests saw no active ops.
Reviewed By: lilyjanjigian
Differential Revision: D100844598