Skip to content

Fold initNvlFabricTopologies into createCommStateXFromNcclComm and convert CommStateXTest to distributed (#2152)#2152

Open
minsii wants to merge 6 commits intometa-pytorch:mainfrom
minsii:export-D101586421
Open

Fold initNvlFabricTopologies into createCommStateXFromNcclComm and convert CommStateXTest to distributed (#2152)#2152
minsii wants to merge 6 commits intometa-pytorch:mainfrom
minsii:export-D101586421

Conversation

@minsii
Copy link
Copy Markdown
Contributor

@minsii minsii commented Apr 20, 2026

Summary:

Refactor CommStateX initialization to consolidate the two-step factory pattern into a single call:

  • Change createCommStateXFromNcclComm to accept CtranComm* instead of IBootstrap*, accessing bootstrap internally via ctranComm->bootstrap_; return ncclResult_t and assign ctranComm->statex_ directly inside the function
  • Fold initNvlFabricTopologies into the factory function (moved to unnamed namespace), replacing raw NCCL C bootstrapAllGather with the IBootstrap::allGather abstraction (matching how initRankStatesTopology already works)
  • Remove outdated TODO comment about removing ncclx fields from all 3 init.cc versions
  • Enrich the CommStateX init log with commDesc, commHash, rank, nRanks, nLocalRanks
  • Convert CommStateXTest from single-rank unit test to 1x8 distributed test using real ncclComm, testing default topology, noLocal (via global hint), and vCliqueSize (via per-comm ncclx::Hints config)
  • DISABLED_CreateVCliqueSizeFromNcclComm: vCliqueSize uses vnode topology override which sets hostname to vnode_node_N and pid to -1, causing gPid collisions. Fixed in follow-up diff.

Differential Revision: D101586421

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 20, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Apr 20, 2026

@minsii has exported this pull request. If you are a Meta employee, you can view the originating Diff in D101586421.

@minsii minsii force-pushed the export-D101586421 branch from 89df137 to 9b38de7 Compare April 23, 2026 20:51
@meta-codesync meta-codesync Bot changed the title Fold initNvlFabricTopologies into createCommStateXFromNcclComm and convert CommStateXTest to distributed Fold initNvlFabricTopologies into createCommStateXFromNcclComm and convert CommStateXTest to distributed (#2152) Apr 24, 2026
minsii added a commit to minsii/torchcomms-1 that referenced this pull request Apr 24, 2026
…nvert CommStateXTest to distributed (meta-pytorch#2152)

Summary:

Refactor CommStateX initialization to consolidate the two-step factory pattern into a single call:
- Change `createCommStateXFromNcclComm` to accept `CtranComm*` instead of `IBootstrap*`, accessing bootstrap internally via `ctranComm->bootstrap_`; return `ncclResult_t` and assign `ctranComm->statex_` directly inside the function
- Fold `initNvlFabricTopologies` into the factory function (moved to unnamed namespace), replacing raw NCCL C `bootstrapAllGather` with the `IBootstrap::allGather` abstraction (matching how `initRankStatesTopology` already works)
- Remove outdated TODO comment about removing ncclx fields from all 3 init.cc versions
- Enrich the CommStateX init log with commDesc, commHash, rank, nRanks, nLocalRanks
- Convert CommStateXTest from single-rank unit test to 1x8 distributed test using real ncclComm, testing default topology, noLocal (via global hint), and vCliqueSize (via per-comm ncclx::Hints config)
- DISABLED_CreateVCliqueSizeFromNcclComm: vCliqueSize uses vnode topology override which sets hostname to `vnode_node_N` and pid to -1, causing gPid collisions. Fixed in follow-up diff.

Differential Revision: D101586421
@minsii minsii force-pushed the export-D101586421 branch 2 times, most recently from 2c23dab to 8f87dc6 Compare April 24, 2026 21:03
minsii added a commit to minsii/torchcomms-1 that referenced this pull request Apr 24, 2026
…nvert CommStateXTest to distributed (meta-pytorch#2152)

Summary:
Pull Request resolved: meta-pytorch#2152

Refactor CommStateX initialization to consolidate the two-step factory pattern into a single call:
- Change `createCommStateXFromNcclComm` to accept `CtranComm*` instead of `IBootstrap*`, accessing bootstrap internally via `ctranComm->bootstrap_`; return `ncclResult_t` and assign `ctranComm->statex_` directly inside the function
- Fold `initNvlFabricTopologies` into the factory function (moved to unnamed namespace), replacing raw NCCL C `bootstrapAllGather` with the `IBootstrap::allGather` abstraction (matching how `initRankStatesTopology` already works)
- Remove outdated TODO comment about removing ncclx fields from all 3 init.cc versions
- Enrich the CommStateX init log with commDesc, commHash, rank, nRanks, nLocalRanks
- Convert CommStateXTest from single-rank unit test to 1x8 distributed test using real ncclComm, testing default topology, noLocal (via global hint), and vCliqueSize (via per-comm ncclx::Hints config)
- DISABLED_CreateVCliqueSizeFromNcclComm: vCliqueSize uses vnode topology override which sets hostname to `vnode_node_N` and pid to -1, causing gPid collisions. Fixed in follow-up diff.

Differential Revision: D101586421
minsii added a commit to minsii/torchcomms-1 that referenced this pull request Apr 24, 2026
…nvert CommStateXTest to distributed (meta-pytorch#2152)

Summary:

Refactor CommStateX initialization to consolidate the two-step factory pattern into a single call:
- Change `createCommStateXFromNcclComm` to accept `CtranComm*` instead of `IBootstrap*`, accessing bootstrap internally via `ctranComm->bootstrap_`; return `ncclResult_t` and assign `ctranComm->statex_` directly inside the function
- Fold `initNvlFabricTopologies` into the factory function (moved to unnamed namespace), replacing raw NCCL C `bootstrapAllGather` with the `IBootstrap::allGather` abstraction (matching how `initRankStatesTopology` already works)
- Remove outdated TODO comment about removing ncclx fields from all 3 init.cc versions
- Enrich the CommStateX init log with commDesc, commHash, rank, nRanks, nLocalRanks
- Convert CommStateXTest from single-rank unit test to 1x8 distributed test using real ncclComm, testing default topology, noLocal (via global hint), and vCliqueSize (via per-comm ncclx::Hints config)
- DISABLED_CreateVCliqueSizeFromNcclComm: vCliqueSize uses vnode topology override which sets hostname to `vnode_node_N` and pid to -1, causing gPid collisions. Fixed in follow-up diff.

Differential Revision: D101586421
@minsii minsii force-pushed the export-D101586421 branch 2 times, most recently from 321d1c3 to 6173d46 Compare April 27, 2026 20:09
minsii added a commit to minsii/torchcomms-1 that referenced this pull request Apr 27, 2026
…nvert CommStateXTest to distributed (meta-pytorch#2152)

Summary:

Refactor CommStateX initialization to consolidate the two-step factory pattern into a single call:
- Change `createCommStateXFromNcclComm` to accept `CtranComm*` instead of `IBootstrap*`, accessing bootstrap internally via `ctranComm->bootstrap_`; return `ncclResult_t` and assign `ctranComm->statex_` directly inside the function
- Fold `initNvlFabricTopologies` into the factory function (moved to unnamed namespace), replacing raw NCCL C `bootstrapAllGather` with the `IBootstrap::allGather` abstraction (matching how `initRankStatesTopology` already works)
- Remove outdated TODO comment about removing ncclx fields from all 3 init.cc versions
- Enrich the CommStateX init log with commDesc, commHash, rank, nRanks, nLocalRanks
- Convert CommStateXTest from single-rank unit test to 1x8 distributed test using real ncclComm, testing default topology, noLocal (via global hint), and vCliqueSize (via per-comm ncclx::Hints config)
- DISABLED_CreateVCliqueSizeFromNcclComm: vCliqueSize uses vnode topology override which sets hostname to `vnode_node_N` and pid to -1, causing gPid collisions. Fixed in follow-up diff.

Differential Revision: D101586421
yulunwang and others added 6 commits May 2, 2026 19:44
Summary:
Remove the old colltrace fallback paths that were guarded by NCCL_COLLTRACE_USE_NEW_COLLTRACE. The new colltrace is now the only path.

Changes:
- CollTraceFunc.cc: collTraceBaselineGetHandle() now always uses the new colltrace path. Delete all old event acquisition/recording functions (collTraceAquireEvent*, collTraceRecordStartEvent, collTraceRecordEndEvent, etc.) that served the old CollTraceEvent pipeline.
- CollTraceFunc.h: Remove declarations for deleted functions. Remove CollTrace.h include.
- Delete CollTraceLegacyHandle.h/.cc (adapter from old events to ICollTraceHandle interface, no longer used).
- ctran CollTraceWrapper: Remove legacyFunc static variable, setCollTraceLegacyHandleFunc(), and the CVAR-guarded fallback in getCollTraceHandle(). Always use getNewCollTraceHandle().
- v2_27/v2_28/v2_29 param.cc: Delete initLegacyColltraceForCtran() and its call_once invocation. Remove includes for CollTraceFunc.h, CollTraceLegacyHandle.h, and ctran CollTraceWrapper.h.
- Delete old colltrace tests: CollTraceDistTest.cc (tested old colltrace with USE_NEW_COLLTRACE=0) and SlowCollReporterUT.cc (tested SlowCollReporter from old CollTrace). Equivalent coverage exists in NewCollTraceDistTest{NoLocal,Local}.cc.
- AllToAllTest.cc: Remove commented-out old colltrace dump code.
- CommWithCtranTest.cc: Remove assertion comparing old collTrace_ field.

Differential Revision: D102560243
Differential Revision: D102736359
…istributed

Summary:
Deduplicate statex gPid/host validation logic repeated across ncclx tests into a shared helper:
- Add `VerifyCommStateXHelper` class in `ctran/tests/VerifyCommStateXUtil.{h,cc}` (namespace `ctran::testing`) with `RankIdentity::local()` factory, `verifyAllHosts()`, and `verifyAllGPids()` methods
- Convert `CommStateXTest` from single-rank unit test to 1x8 distributed test using real ncclComm, testing default topology, noLocal (via global hint), and vCliqueSize (via per-comm ncclx::Hints config)
- DISABLED_CreateVCliqueSizeFromNcclComm: vCliqueSize uses vnode topology override which sets hostname to `vnode_node_N` and pid to -1, causing gPid collisions. Fixed in follow-up diff.

Differential Revision: D102739420
…nvert CommStateXTest to distributed (meta-pytorch#2152)

Summary:
Pull Request resolved: meta-pytorch#2152

Refactor CommStateX initialization to consolidate the two-step factory pattern into a single call:
- Change `createCommStateXFromNcclComm` to accept `CtranComm*` instead of `IBootstrap*`, accessing bootstrap internally via `ctranComm->bootstrap_`; return `ncclResult_t` and assign `ctranComm->statex_` directly inside the function
- Fold `initNvlFabricTopologies` into the factory function (moved to unnamed namespace), replacing raw NCCL C `bootstrapAllGather` with the `IBootstrap::allGather` abstraction (matching how `initRankStatesTopology` already works)
- Remove outdated TODO comment about removing ncclx fields from all 3 init.cc versions
- Enrich the CommStateX init log with commDesc, commHash, rank, nRanks, nLocalRanks
- Convert CommStateXTest from single-rank unit test to 1x8 distributed test using real ncclComm, testing default topology, noLocal (via global hint), and vCliqueSize (via per-comm ncclx::Hints config)
- DISABLED_CreateVCliqueSizeFromNcclComm: vCliqueSize uses vnode topology override which sets hostname to `vnode_node_N` and pid to -1, causing gPid collisions. Fixed in follow-up diff.

Differential Revision: D101586421
@minsii minsii force-pushed the export-D101586421 branch from 6173d46 to f65e4a2 Compare May 3, 2026 03:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant