Skip to content

Fix ENABLE_PIPES vs TORCHCOMMS_HAS_NCCL_DEVICE_API gating mismatch#2076

Open
lilyjanjigian wants to merge 2 commits intometa-pytorch:mainfrom
lilyjanjigian:export-D100887961
Open

Fix ENABLE_PIPES vs TORCHCOMMS_HAS_NCCL_DEVICE_API gating mismatch#2076
lilyjanjigian wants to merge 2 commits intometa-pytorch:mainfrom
lilyjanjigian:export-D100887961

Conversation

@lilyjanjigian
Copy link
Copy Markdown
Contributor

Summary:
The explicit template instantiation of TorchCommWindowNCCLX<PipesDeviceBackend> was nested inside #ifdef TORCHCOMMS_HAS_NCCL_DEVICE_API, but the type alias (TorchCommWindowNCCLXPipes) and the runtime window creation in new_window() are gated by #if defined(ENABLE_PIPES) only. This creates a mismatch when ENABLE_PIPES=1 but TORCHCOMMS_HAS_NCCL_DEVICE_API is not set (e.g., NCCLX 2.27): the type exists and new_window() tries to instantiate it, but the linker can't find the symbols because the template was never explicitly instantiated.

In internal Buck builds this doesn't currently manifest because both flags are always enabled together — ENABLE_PIPES comes from ctran_lib via torchcomm-device-pipes, which is only a dependency when has_ncclx_device_api() returns true (NCCLX >= 2.28). However, in CMakeLists.txt builds (conda feedstock), ENABLE_PIPES is controlled independently via an environment variable, so it's possible to set ENABLE_PIPES=1 with an NCCLX version that doesn't have device API headers.

Fix: Move the Pipes explicit template instantiation out of the TORCHCOMMS_HAS_NCCL_DEVICE_API guard so it's gated by ENABLE_PIPES only, matching the header and new_window() runtime usage. When device API is unavailable, host-side window operations (put, signal, wait_signal) still work; device API methods (get_device_window, register_local_buffer) fall back to the base class default that throws "not yet supported".

Broken MAST run: h100-6b4b53d83792-260408-1412_perception_v5g_tpp_800_TIER4_-wqtnkwmg

Reviewed By: lilyjanjigian

Differential Revision: D100887961

lilyjanjigian and others added 2 commits April 14, 2026 17:38
Summary:

Several diffs landed over the past few months that introduced ncclx-only types (ncclWindow_t, ncclx::Hints, NCCL_FAST_INIT_MODE_RING) into torchcomms code without version guards. This broke the build when using hpc_comms.use_nccl=stable (upstream NCCL v2_27), which doesn't define these types. The ~15-20 backend_nccl and backend_gloo tests in TestX that build with this config were all failing at compile time.

Fixes:
  - NcclxApi.hpp: Replace constexpr NCCL_WIN_DEFAULT with #ifndef/#define guard to avoid collision with the macro in nccl.h
  - TorchCommNCCLXBootstrap.hpp/.cpp: Wrap ncclx::Hints and NCCL_FAST_INIT_MODE_RING usage with #ifdef NCCLX_CONFIG_SUPPORTED, with fallback paths for upstream NCCL
  - TorchCommNCCLX.cpp: Same ncclx::Hints guard in the split function
  - TorchCommWindowNCCLX.cpp: Wrap get_attr() body with #ifdef NCCL_RMA_SUPPORTED
  - DeviceBackendTraits.hpp: Conditional Window type alias (ncclWindow_t vs void*) based on NCCL_RMA_SUPPORTED
  - PipesDeviceBackend.hpp: Added NcclWin type alias with same conditional
  - ir_include/nccl.h: Added missing NCCL_RMA_SUPPORTED define to the IR stub header used by the device_window_bitcode genrule

Reviewed By: goelayu

Differential Revision: D100670686
Summary:
The explicit template instantiation of `TorchCommWindowNCCLX<PipesDeviceBackend>` was nested inside `#ifdef TORCHCOMMS_HAS_NCCL_DEVICE_API`, but the type alias (`TorchCommWindowNCCLXPipes`) and the runtime window creation in `new_window()` are gated by `#if defined(ENABLE_PIPES)` only. This creates a mismatch when `ENABLE_PIPES=1` but `TORCHCOMMS_HAS_NCCL_DEVICE_API` is not set (e.g., NCCLX 2.27): the type exists and `new_window()` tries to instantiate it, but the linker can't find the symbols because the template was never explicitly instantiated.

In internal Buck builds this doesn't currently manifest because both flags are always enabled together — `ENABLE_PIPES` comes from `ctran_lib` via `torchcomm-device-pipes`, which is only a dependency when `has_ncclx_device_api()` returns true (NCCLX >= 2.28). However, in CMakeLists.txt builds (conda feedstock), `ENABLE_PIPES` is controlled independently via an environment variable, so it's possible to set `ENABLE_PIPES=1` with an NCCLX version that doesn't have device API headers.

Fix: Move the Pipes explicit template instantiation out of the `TORCHCOMMS_HAS_NCCL_DEVICE_API` guard so it's gated by `ENABLE_PIPES` only, matching the header and `new_window()` runtime usage. When device API is unavailable, host-side window operations (put, signal, wait_signal) still work; device API methods (get_device_window, register_local_buffer) fall back to the base class default that throws "not yet supported".

Broken MAST run: h100-6b4b53d83792-260408-1412_perception_v5g_tpp_800_TIER4_-wqtnkwmg

Reviewed By: lilyjanjigian

Differential Revision: D100887961
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 15, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync bot commented Apr 15, 2026

@lilyjanjigian has exported this pull request. If you are a Meta employee, you can view the originating Diff in D100887961.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant