Revert gPid rank suffix and fix nolocal/vnode topology host assignment by Regina8023 · Pull Request #2074 · meta-pytorch/torchcomms

Regina8023 · 2026-04-14T22:33:19Z

Summary:

Summary

Reverts D92879823 which appended :rank to gPid() format (hostname:pid:rank). This contradicts the purpose of the IPC import cache (ipcRemRegMap_), which should be keyed by process identity (hostname:pid) since CUDA IPC handles are per-process, not per-rank. Using per-rank keys creates duplicate cache entries for ranks in the same process and causes key mismatches between ipcRemRegMap_ (keyed by gPid) and remImportMap_ (keyed by localPeerId_ which is hostname:pid).

D92879823 was a workaround for MCCL tests using NOLOCAL topology mode, where initRankTopologyNolocal() never assigned rankState.host (stayed "") or rankState.pid (stayed -1), causing all ranks to have identical gPid = ":-1". The proper fix is to assign the synthetic hostname to rankState.host so that gPid() naturally returns unique values without needing the rank suffix.

Changes

CommStateX.cc:

initRankTopologyNolocal(): Assign rankState.host = nolocalHost ("nolocal_node_<r>") so gPid() returns "nolocal_node_0:-1", "nolocal_node_1:-1", etc. — unique per rank without the :rank suffix.
initRankTopologyVnode(): Same fix — assign rankState.host = vNodeHost ("vnode_node_<nodeId>") for consistency.
gPid(): Revert to hostname:pid format (remove :rank suffix), restoring alignment with localPeerId_ and IPC cache key semantics.

Why

IPC handles are per-process: cudaIpcOpenMemHandle returns the same mapped base regardless of which rank in the process calls it. hostname:pid is the natural process identifier.
Deduplication: With hostname:pid, same-process ranks share one cache entry with proper refcounting. With hostname:pid:rank, they create duplicate entries.

Reviewed By: elvinlife

Differential Revision: D100055198

…gmentMem test (meta-pytorch#2073) Summary: **Summary:** Fix `ExportMultiSegmentMem` test and `AsyncSocket` variable-length message receive bug. 1. `CtranMapper::regMem` has a `FB_CHECKABORT(segments.size() == 1)` that fatally aborts when called with a buffer backed by multiple physical segments (e.g., `kCuMemAllocDisjoint`). This test allocates disjoint memory with `CTRAN_IPC_INLINE_SEGMENTS + 1` physical segments, which always triggers the abort. The fix replaces `mapper->regMem`/`deregMem` with `RegCache::globalRegister`/`globalDeregister`, which natively support multi-segment buffers. 2. `AsyncSocket.cc`: Fix IOBuf under-reservation in computeTotalSize(). `IOBuf::reserve(minHeadroom, minTailroom)` sets an absolute minimum tailroom, not a delta. When `IOBuf::create(sizeof(IpcReq))` rounds capacity up (e.g., jemalloc gives 512 for a 500-byte request), partial tailroom remains after reading the header. The old code reserve(0, needed - tailroom) requested too few bytes — after the next read consumed the available tailroom, 0 bytes remained for subsequent reads, causing folly to error with getReadBuffer() returned empty buffer. Fix: `reserve(0, needed)` to guarantee enough tailroom for the full remaining payload. 3. Also renames the BUCK target from `regcache_init_none` to `regcache_dist_ut_init_none` to match the intended target name. Reviewed By: dsjohns2 Differential Revision: D100862851

Summary: ### Summary Reverts [D92879823](https://www.internalfb.com/diff/D92879823) which appended `:rank` to `gPid()` format (`hostname:pid:rank`). This contradicts the purpose of the IPC import cache (`ipcRemRegMap_`), which should be keyed by process identity (`hostname:pid`) since CUDA IPC handles are per-process, not per-rank. Using per-rank keys creates duplicate cache entries for ranks in the same process and causes key mismatches between `ipcRemRegMap_` (keyed by `gPid`) and `remImportMap_` (keyed by `localPeerId_` which is `hostname:pid`). D92879823 was a workaround for MCCL tests using NOLOCAL topology mode, where `initRankTopologyNolocal()` never assigned `rankState.host` (stayed `""`) or `rankState.pid` (stayed `-1`), causing all ranks to have identical `gPid = ":-1"`. The proper fix is to assign the synthetic hostname to `rankState.host` so that `gPid()` naturally returns unique values without needing the rank suffix. ### Changes **`CommStateX.cc`**: - `initRankTopologyNolocal()`: Assign `rankState.host = nolocalHost` (`"nolocal_node_<r>"`) so `gPid()` returns `"nolocal_node_0:-1"`, `"nolocal_node_1:-1"`, etc. — unique per rank without the `:rank` suffix. - `initRankTopologyVnode()`: Same fix — assign `rankState.host = vNodeHost` (`"vnode_node_<nodeId>"`) for consistency. - `gPid()`: Revert to `hostname:pid` format (remove `:rank` suffix), restoring alignment with `localPeerId_` and IPC cache key semantics. ### Why - **IPC handles are per-process**: `cudaIpcOpenMemHandle` returns the same mapped base regardless of which rank in the process calls it. `hostname:pid` is the natural process identifier. - **Deduplication**: With `hostname:pid`, same-process ranks share one cache entry with proper refcounting. With `hostname:pid:rank`, they create duplicate entries. Reviewed By: elvinlife Differential Revision: D100055198

meta-codesync · 2026-04-14T22:33:29Z

@Regina8023 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D100055198.

meta-codesync · 2026-04-15T16:53:36Z

This pull request has been merged in f6779a6.

Regina8023 added 2 commits April 14, 2026 15:33

pytorch-bot bot added the ci-no-td label Apr 14, 2026

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 14, 2026

meta-codesync bot added fb-exported meta-exported labels Apr 14, 2026

meta-codesync bot closed this in f6779a6 Apr 15, 2026

facebook-github-tools bot added the Merged label Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert gPid rank suffix and fix nolocal/vnode topology host assignment#2074

Revert gPid rank suffix and fix nolocal/vnode topology host assignment#2074
Regina8023 wants to merge 2 commits intometa-pytorch:mainfrom
Regina8023:export-D100055198

Regina8023 commented Apr 14, 2026

Uh oh!

meta-codesync bot commented Apr 14, 2026

Uh oh!

meta-codesync bot commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Regina8023 commented Apr 14, 2026

Summary

Changes

Why

Uh oh!

meta-codesync bot commented Apr 14, 2026

Uh oh!

meta-codesync bot commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant