Fix ENABLE_PIPES vs TORCHCOMMS_HAS_NCCL_DEVICE_API gating mismatch by lilyjanjigian · Pull Request #2076 · meta-pytorch/torchcomms

lilyjanjigian · 2026-04-15T00:38:31Z

Summary:
The explicit template instantiation of TorchCommWindowNCCLX<PipesDeviceBackend> was nested inside #ifdef TORCHCOMMS_HAS_NCCL_DEVICE_API, but the type alias (TorchCommWindowNCCLXPipes) and the runtime window creation in new_window() are gated by #if defined(ENABLE_PIPES) only. This creates a mismatch when ENABLE_PIPES=1 but TORCHCOMMS_HAS_NCCL_DEVICE_API is not set (e.g., NCCLX 2.27): the type exists and new_window() tries to instantiate it, but the linker can't find the symbols because the template was never explicitly instantiated.

In internal Buck builds this doesn't currently manifest because both flags are always enabled together — ENABLE_PIPES comes from ctran_lib via torchcomm-device-pipes, which is only a dependency when has_ncclx_device_api() returns true (NCCLX >= 2.28). However, in CMakeLists.txt builds (conda feedstock), ENABLE_PIPES is controlled independently via an environment variable, so it's possible to set ENABLE_PIPES=1 with an NCCLX version that doesn't have device API headers.

Fix: Move the Pipes explicit template instantiation out of the TORCHCOMMS_HAS_NCCL_DEVICE_API guard so it's gated by ENABLE_PIPES only, matching the header and new_window() runtime usage. When device API is unavailable, host-side window operations (put, signal, wait_signal) still work; device API methods (get_device_window, register_local_buffer) fall back to the base class default that throws "not yet supported".

Broken MAST run: h100-6b4b53d83792-260408-1412_perception_v5g_tpp_800_TIER4_-wqtnkwmg

Reviewed By: lilyjanjigian

Differential Revision: D100887961

Summary: Several diffs landed over the past few months that introduced ncclx-only types (ncclWindow_t, ncclx::Hints, NCCL_FAST_INIT_MODE_RING) into torchcomms code without version guards. This broke the build when using hpc_comms.use_nccl=stable (upstream NCCL v2_27), which doesn't define these types. The ~15-20 backend_nccl and backend_gloo tests in TestX that build with this config were all failing at compile time. Fixes: - NcclxApi.hpp: Replace constexpr NCCL_WIN_DEFAULT with #ifndef/#define guard to avoid collision with the macro in nccl.h - TorchCommNCCLXBootstrap.hpp/.cpp: Wrap ncclx::Hints and NCCL_FAST_INIT_MODE_RING usage with #ifdef NCCLX_CONFIG_SUPPORTED, with fallback paths for upstream NCCL - TorchCommNCCLX.cpp: Same ncclx::Hints guard in the split function - TorchCommWindowNCCLX.cpp: Wrap get_attr() body with #ifdef NCCL_RMA_SUPPORTED - DeviceBackendTraits.hpp: Conditional Window type alias (ncclWindow_t vs void*) based on NCCL_RMA_SUPPORTED - PipesDeviceBackend.hpp: Added NcclWin type alias with same conditional - ir_include/nccl.h: Added missing NCCL_RMA_SUPPORTED define to the IR stub header used by the device_window_bitcode genrule Reviewed By: goelayu Differential Revision: D100670686

Summary: The explicit template instantiation of `TorchCommWindowNCCLX<PipesDeviceBackend>` was nested inside `#ifdef TORCHCOMMS_HAS_NCCL_DEVICE_API`, but the type alias (`TorchCommWindowNCCLXPipes`) and the runtime window creation in `new_window()` are gated by `#if defined(ENABLE_PIPES)` only. This creates a mismatch when `ENABLE_PIPES=1` but `TORCHCOMMS_HAS_NCCL_DEVICE_API` is not set (e.g., NCCLX 2.27): the type exists and `new_window()` tries to instantiate it, but the linker can't find the symbols because the template was never explicitly instantiated. In internal Buck builds this doesn't currently manifest because both flags are always enabled together — `ENABLE_PIPES` comes from `ctran_lib` via `torchcomm-device-pipes`, which is only a dependency when `has_ncclx_device_api()` returns true (NCCLX >= 2.28). However, in CMakeLists.txt builds (conda feedstock), `ENABLE_PIPES` is controlled independently via an environment variable, so it's possible to set `ENABLE_PIPES=1` with an NCCLX version that doesn't have device API headers. Fix: Move the Pipes explicit template instantiation out of the `TORCHCOMMS_HAS_NCCL_DEVICE_API` guard so it's gated by `ENABLE_PIPES` only, matching the header and `new_window()` runtime usage. When device API is unavailable, host-side window operations (put, signal, wait_signal) still work; device API methods (get_device_window, register_local_buffer) fall back to the base class default that throws "not yet supported". Broken MAST run: h100-6b4b53d83792-260408-1412_perception_v5g_tpp_800_TIER4_-wqtnkwmg Reviewed By: lilyjanjigian Differential Revision: D100887961

meta-codesync · 2026-04-15T00:38:41Z

@lilyjanjigian has exported this pull request. If you are a Meta employee, you can view the originating Diff in D100887961.

lilyjanjigian and others added 2 commits April 14, 2026 17:38

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 15, 2026

meta-codesync bot added fb-exported meta-exported labels Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ENABLE_PIPES vs TORCHCOMMS_HAS_NCCL_DEVICE_API gating mismatch#2076

Fix ENABLE_PIPES vs TORCHCOMMS_HAS_NCCL_DEVICE_API gating mismatch#2076
lilyjanjigian wants to merge 2 commits intometa-pytorch:mainfrom
lilyjanjigian:export-D100887961

lilyjanjigian commented Apr 15, 2026

Uh oh!

meta-codesync bot commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lilyjanjigian commented Apr 15, 2026

Uh oh!

meta-codesync bot commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant