Fix ENABLE_PIPES vs TORCHCOMMS_HAS_NCCL_DEVICE_API gating mismatch#2076
Open
lilyjanjigian wants to merge 2 commits intometa-pytorch:mainfrom
Open
Fix ENABLE_PIPES vs TORCHCOMMS_HAS_NCCL_DEVICE_API gating mismatch#2076lilyjanjigian wants to merge 2 commits intometa-pytorch:mainfrom
lilyjanjigian wants to merge 2 commits intometa-pytorch:mainfrom
Conversation
Summary: Several diffs landed over the past few months that introduced ncclx-only types (ncclWindow_t, ncclx::Hints, NCCL_FAST_INIT_MODE_RING) into torchcomms code without version guards. This broke the build when using hpc_comms.use_nccl=stable (upstream NCCL v2_27), which doesn't define these types. The ~15-20 backend_nccl and backend_gloo tests in TestX that build with this config were all failing at compile time. Fixes: - NcclxApi.hpp: Replace constexpr NCCL_WIN_DEFAULT with #ifndef/#define guard to avoid collision with the macro in nccl.h - TorchCommNCCLXBootstrap.hpp/.cpp: Wrap ncclx::Hints and NCCL_FAST_INIT_MODE_RING usage with #ifdef NCCLX_CONFIG_SUPPORTED, with fallback paths for upstream NCCL - TorchCommNCCLX.cpp: Same ncclx::Hints guard in the split function - TorchCommWindowNCCLX.cpp: Wrap get_attr() body with #ifdef NCCL_RMA_SUPPORTED - DeviceBackendTraits.hpp: Conditional Window type alias (ncclWindow_t vs void*) based on NCCL_RMA_SUPPORTED - PipesDeviceBackend.hpp: Added NcclWin type alias with same conditional - ir_include/nccl.h: Added missing NCCL_RMA_SUPPORTED define to the IR stub header used by the device_window_bitcode genrule Reviewed By: goelayu Differential Revision: D100670686
Summary: The explicit template instantiation of `TorchCommWindowNCCLX<PipesDeviceBackend>` was nested inside `#ifdef TORCHCOMMS_HAS_NCCL_DEVICE_API`, but the type alias (`TorchCommWindowNCCLXPipes`) and the runtime window creation in `new_window()` are gated by `#if defined(ENABLE_PIPES)` only. This creates a mismatch when `ENABLE_PIPES=1` but `TORCHCOMMS_HAS_NCCL_DEVICE_API` is not set (e.g., NCCLX 2.27): the type exists and `new_window()` tries to instantiate it, but the linker can't find the symbols because the template was never explicitly instantiated. In internal Buck builds this doesn't currently manifest because both flags are always enabled together — `ENABLE_PIPES` comes from `ctran_lib` via `torchcomm-device-pipes`, which is only a dependency when `has_ncclx_device_api()` returns true (NCCLX >= 2.28). However, in CMakeLists.txt builds (conda feedstock), `ENABLE_PIPES` is controlled independently via an environment variable, so it's possible to set `ENABLE_PIPES=1` with an NCCLX version that doesn't have device API headers. Fix: Move the Pipes explicit template instantiation out of the `TORCHCOMMS_HAS_NCCL_DEVICE_API` guard so it's gated by `ENABLE_PIPES` only, matching the header and `new_window()` runtime usage. When device API is unavailable, host-side window operations (put, signal, wait_signal) still work; device API methods (get_device_window, register_local_buffer) fall back to the base class default that throws "not yet supported". Broken MAST run: h100-6b4b53d83792-260408-1412_perception_v5g_tpp_800_TIER4_-wqtnkwmg Reviewed By: lilyjanjigian Differential Revision: D100887961
Contributor
|
@lilyjanjigian has exported this pull request. If you are a Meta employee, you can view the originating Diff in D100887961. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
The explicit template instantiation of
TorchCommWindowNCCLX<PipesDeviceBackend>was nested inside#ifdef TORCHCOMMS_HAS_NCCL_DEVICE_API, but the type alias (TorchCommWindowNCCLXPipes) and the runtime window creation innew_window()are gated by#if defined(ENABLE_PIPES)only. This creates a mismatch whenENABLE_PIPES=1butTORCHCOMMS_HAS_NCCL_DEVICE_APIis not set (e.g., NCCLX 2.27): the type exists andnew_window()tries to instantiate it, but the linker can't find the symbols because the template was never explicitly instantiated.In internal Buck builds this doesn't currently manifest because both flags are always enabled together —
ENABLE_PIPEScomes fromctran_libviatorchcomm-device-pipes, which is only a dependency whenhas_ncclx_device_api()returns true (NCCLX >= 2.28). However, in CMakeLists.txt builds (conda feedstock),ENABLE_PIPESis controlled independently via an environment variable, so it's possible to setENABLE_PIPES=1with an NCCLX version that doesn't have device API headers.Fix: Move the Pipes explicit template instantiation out of the
TORCHCOMMS_HAS_NCCL_DEVICE_APIguard so it's gated byENABLE_PIPESonly, matching the header andnew_window()runtime usage. When device API is unavailable, host-side window operations (put, signal, wait_signal) still work; device API methods (get_device_window, register_local_buffer) fall back to the base class default that throws "not yet supported".Broken MAST run: h100-6b4b53d83792-260408-1412_perception_v5g_tpp_800_TIER4_-wqtnkwmg
Reviewed By: lilyjanjigian
Differential Revision: D100887961