Add isInitialized() API to TorchCommBackend and update TorchCommMCCL::reconfigure initialization state (#2071) by Scusemua · Pull Request #2071 · meta-pytorch/torchcomms

Scusemua · 2026-04-14T19:59:54Z

Summary:

Add isInitialized() virtual method to TorchCommBackend so callers can check whether a backend is ready for collective operations before issuing them. The MCCL implementation now correctly ties initialization state to a successful reconfigure() result rather than unconditionally marking itself initialized, preventing use of a backend whose reconfigure failed.

Specifically, we now set initState_ = UNINITIALIZED before calling mccl_comm_->reconfigure(), then only promote to INITIALIZED inside the existing success block (where result->code == commSuccess) alongside rank_ and commSize_ updates. Previously initState_ = INITIALIZED was set unconditionally after createWork(), which allowed subsequent operations to proceed on a broken communicator after a failed reconfigure.

Differential Revision: D100673963

meta-codesync · 2026-04-14T20:00:12Z

@Scusemua has exported this pull request. If you are a Meta employee, you can view the originating Diff in D100673963.

…:reconfigure initialization state (meta-pytorch#2071) Summary: Add `isInitialized()` virtual method to `TorchCommBackend` so callers can check whether a backend is ready for collective operations before issuing them. The MCCL implementation now correctly ties initialization state to a successful `reconfigure()` result rather than unconditionally marking itself initialized, preventing use of a backend whose reconfigure failed. Specifically, we now set `initState_ = UNINITIALIZED` before calling `mccl_comm_->reconfigure()`, then only promote to `INITIALIZED` inside the existing success block (where `result->code == commSuccess`) alongside `rank_` and `commSize_` updates. Previously `initState_ = INITIALIZED` was set unconditionally after `createWork()`, which allowed subsequent operations to proceed on a broken communicator after a failed reconfigure. Differential Revision: D100673963

…:reconfigure initialization state (meta-pytorch#2071) Summary: Pull Request resolved: meta-pytorch#2071 Add `isInitialized()` virtual method to `TorchCommBackend` so callers can check whether a backend is ready for collective operations before issuing them. The MCCL implementation now correctly ties initialization state to a successful `reconfigure()` result rather than unconditionally marking itself initialized, preventing use of a backend whose reconfigure failed. Specifically, we now set `initState_ = UNINITIALIZED` before calling `mccl_comm_->reconfigure()`, then only promote to `INITIALIZED` inside the existing success block (where `result->code == commSuccess`) alongside `rank_` and `commSize_` updates. Previously `initState_ = INITIALIZED` was set unconditionally after `createWork()`, which allowed subsequent operations to proceed on a broken communicator after a failed reconfigure. Differential Revision: D100673963

…:reconfigure initialization state (meta-pytorch#2071) Summary: Add `isInitialized()` virtual method to `TorchCommBackend` so callers can check whether a backend is ready for collective operations before issuing them. The MCCL implementation now correctly ties initialization state to a successful `reconfigure()` result rather than unconditionally marking itself initialized, preventing use of a backend whose reconfigure failed. Specifically, we now set `initState_ = UNINITIALIZED` before calling `mccl_comm_->reconfigure()`, then only promote to `INITIALIZED` inside the existing success block (where `result->code == commSuccess`) alongside `rank_` and `commSize_` updates. Previously `initState_ = INITIALIZED` was set unconditionally after `createWork()`, which allowed subsequent operations to proceed on a broken communicator after a failed reconfigure. Differential Revision: D100673963

…:reconfigure initialization state (meta-pytorch#2071) Summary: Pull Request resolved: meta-pytorch#2071 Add `isInitialized()` virtual method to `TorchCommBackend` so callers can check whether a backend is ready for collective operations before issuing them. The MCCL implementation now correctly ties initialization state to a successful `reconfigure()` result rather than unconditionally marking itself initialized, preventing use of a backend whose reconfigure failed. Specifically, we now set `initState_ = UNINITIALIZED` before calling `mccl_comm_->reconfigure()`, then only promote to `INITIALIZED` inside the existing success block (where `result->code == commSuccess`) alongside `rank_` and `commSize_` updates. Previously `initState_ = INITIALIZED` was set unconditionally after `createWork()`, which allowed subsequent operations to proceed on a broken communicator after a failed reconfigure. Differential Revision: D100673963

…:reconfigure initialization state (meta-pytorch#2071) Summary: Add `isInitialized()` virtual method to `TorchCommBackend` so callers can check whether a backend is ready for collective operations before issuing them. The MCCL implementation now correctly ties initialization state to a successful `reconfigure()` result rather than unconditionally marking itself initialized, preventing use of a backend whose reconfigure failed. Specifically, we now set `initState_ = UNINITIALIZED` before calling `mccl_comm_->reconfigure()`, then only promote to `INITIALIZED` inside the existing success block (where `result->code == commSuccess`) alongside `rank_` and `commSize_` updates. Previously `initState_ = INITIALIZED` was set unconditionally after `createWork()`, which allowed subsequent operations to proceed on a broken communicator after a failed reconfigure. **We also** reset `commSize_` to 0 instead of -1, and we move the reset of `commSize_` and `rank_` to be with the `initState_ = UNITIALIZED`. For one thing, -1 (as an int) silently becomes `SIZE_MAX` when cast to `size_t`, which caused a production SEV once already. And if `commSize_` ever leaks beyond these guards, then 0 is harmless, while -1 is catastrophic (can crash the workload/application). **Concession:** 0 is more of a soft sentinel insofar as it is a valid communicator size, and so you cannot distinguish "uninitialized" from "genuinely zero participants" by looking at `commSize_` alone. But you can just look at `initState_` for this. Differential Revision: D100673963

…:reconfigure initialization state (meta-pytorch#2071) Summary: Pull Request resolved: meta-pytorch#2071 Add `isInitialized()` virtual method to `TorchCommBackend` so callers can check whether a backend is ready for collective operations before issuing them. The MCCL implementation now correctly ties initialization state to a successful `reconfigure()` result rather than unconditionally marking itself initialized, preventing use of a backend whose reconfigure failed. Specifically, we now set `initState_ = UNINITIALIZED` before calling `mccl_comm_->reconfigure()`, then only promote to `INITIALIZED` inside the existing success block (where `result->code == commSuccess`) alongside `rank_` and `commSize_` updates. Previously `initState_ = INITIALIZED` was set unconditionally after `createWork()`, which allowed subsequent operations to proceed on a broken communicator after a failed reconfigure. **We also** reset `commSize_` to 0 instead of -1, and we move the reset of `commSize_` and `rank_` to be with the `initState_ = UNITIALIZED`. For one thing, -1 (as an int) silently becomes `SIZE_MAX` when cast to `size_t`, which caused a production SEV once already. And if `commSize_` ever leaks beyond these guards, then 0 is harmless, while -1 is catastrophic (can crash the workload/application). **Concession:** 0 is more of a soft sentinel insofar as it is a valid communicator size, and so you cannot distinguish "uninitialized" from "genuinely zero participants" by looking at `commSize_` alone. But you can just look at `initState_` for this. Differential Revision: D100673963

…:reconfigure initialization state (meta-pytorch#2071) Summary: Add `isInitialized()` virtual method to `TorchCommBackend` so callers can check whether a backend is ready for collective operations before issuing them. The MCCL implementation now correctly ties initialization state to a successful `reconfigure()` result rather than unconditionally marking itself initialized, preventing use of a backend whose reconfigure failed. Specifically, we now set `initState_ = UNINITIALIZED` before calling `mccl_comm_->reconfigure()`, then only promote to `INITIALIZED` inside the existing success block (where `result->code == commSuccess`) alongside `rank_` and `commSize_` updates. Previously `initState_ = INITIALIZED` was set unconditionally after `createWork()`, which allowed subsequent operations to proceed on a broken communicator after a failed reconfigure. **We also** reset `commSize_` to 0 instead of -1, and we move the reset of `commSize_` and `rank_` to be with the `initState_ = UNITIALIZED`. For one thing, -1 (as an int) silently becomes `SIZE_MAX` when cast to `size_t`, which caused a production SEV once already. And if `commSize_` ever leaks beyond these guards, then 0 is harmless, while -1 is catastrophic (can crash the workload/application). **Concession:** 0 is more of a soft sentinel insofar as it is a valid communicator size, and so you cannot distinguish "uninitialized" from "genuinely zero participants" by looking at `commSize_` alone. But you can just look at `initState_` for this. Differential Revision: D100673963

…:reconfigure initialization state (meta-pytorch#2071) Summary: Pull Request resolved: meta-pytorch#2071 Add `isInitialized()` virtual method to `TorchCommBackend` so callers can check whether a backend is ready for collective operations before issuing them. The MCCL implementation now correctly ties initialization state to a successful `reconfigure()` result rather than unconditionally marking itself initialized, preventing use of a backend whose reconfigure failed. Specifically, we now set `initState_ = UNINITIALIZED` before calling `mccl_comm_->reconfigure()`, then only promote to `INITIALIZED` inside the existing success block (where `result->code == commSuccess`) alongside `rank_` and `commSize_` updates. Previously `initState_ = INITIALIZED` was set unconditionally after `createWork()`, which allowed subsequent operations to proceed on a broken communicator after a failed reconfigure. **We also** reset `commSize_` to 0 instead of -1, and we move the reset of `commSize_` and `rank_` to be with the `initState_ = UNITIALIZED`. For one thing, -1 (as an int) silently becomes `SIZE_MAX` when cast to `size_t`, which caused a production SEV once already. And if `commSize_` ever leaks beyond these guards, then 0 is harmless, while -1 is catastrophic (can crash the workload/application). **Concession:** 0 is more of a soft sentinel insofar as it is a valid communicator size, and so you cannot distinguish "uninitialized" from "genuinely zero participants" by looking at `commSize_` alone. But you can just look at `initState_` for this. Differential Revision: D100673963

…:reconfigure initialization state (meta-pytorch#2071) Summary: Add `isInitialized()` virtual method to `TorchCommBackend` so callers can check whether a backend is ready for collective operations before issuing them. The MCCL implementation now correctly ties initialization state to a successful `reconfigure()` result rather than unconditionally marking itself initialized, preventing use of a backend whose reconfigure failed. Specifically, we now set `initState_ = UNINITIALIZED` before calling `mccl_comm_->reconfigure()`, then only promote to `INITIALIZED` inside the existing success block (where `result->code == commSuccess`) alongside `rank_` and `commSize_` updates. Previously `initState_ = INITIALIZED` was set unconditionally after `createWork()`, which allowed subsequent operations to proceed on a broken communicator after a failed reconfigure. **We also** reset `commSize_` to 0 instead of -1, and we move the reset of `commSize_` and `rank_` to be with the `initState_ = UNITIALIZED`. For one thing, -1 (as an int) silently becomes `SIZE_MAX` when cast to `size_t`, which caused a production SEV once already. And if `commSize_` ever leaks beyond these guards, then 0 is harmless, while -1 is catastrophic (can crash the workload/application). **Concession:** 0 is more of a soft sentinel insofar as it is a valid communicator size, and so you cannot distinguish "uninitialized" from "genuinely zero participants" by looking at `commSize_` alone. But you can just look at `initState_` for this. Differential Revision: D100673963

…:reconfigure initialization state (meta-pytorch#2071) Summary: Pull Request resolved: meta-pytorch#2071 Add `isInitialized()` virtual method to `TorchCommBackend` so callers can check whether a backend is ready for collective operations before issuing them. The MCCL implementation now correctly ties initialization state to a successful `reconfigure()` result rather than unconditionally marking itself initialized, preventing use of a backend whose reconfigure failed. Specifically, we now set `initState_ = UNINITIALIZED` before calling `mccl_comm_->reconfigure()`, then only promote to `INITIALIZED` inside the existing success block (where `result->code == commSuccess`) alongside `rank_` and `commSize_` updates. Previously `initState_ = INITIALIZED` was set unconditionally after `createWork()`, which allowed subsequent operations to proceed on a broken communicator after a failed reconfigure. **We also** reset `commSize_` to 0 instead of -1, and we move the reset of `commSize_` and `rank_` to be with the `initState_ = UNITIALIZED`. For one thing, -1 (as an int) silently becomes `SIZE_MAX` when cast to `size_t`, which caused a production SEV once already. And if `commSize_` ever leaks beyond these guards, then 0 is harmless, while -1 is catastrophic (can crash the workload/application). **Concession:** 0 is more of a soft sentinel insofar as it is a valid communicator size, and so you cannot distinguish "uninitialized" from "genuinely zero participants" by looking at `commSize_` alone. But you can just look at `initState_` for this. Differential Revision: D100673963

…:reconfigure initialization state (meta-pytorch#2071) Summary: Add `isInitialized()` virtual method to `TorchCommBackend` so callers can check whether a backend is ready for collective operations before issuing them. The MCCL implementation now correctly ties initialization state to a successful `reconfigure()` result rather than unconditionally marking itself initialized, preventing use of a backend whose reconfigure failed. Specifically, we now set `initState_ = UNINITIALIZED` before calling `mccl_comm_->reconfigure()`, then only promote to `INITIALIZED` inside the existing success block (where `result->code == commSuccess`) alongside `rank_` and `commSize_` updates. Previously `initState_ = INITIALIZED` was set unconditionally after `createWork()`, which allowed subsequent operations to proceed on a broken communicator after a failed reconfigure. **We also** reset `commSize_` to 0 instead of -1, and we move the reset of `commSize_` and `rank_` to be with the `initState_ = UNITIALIZED`. For one thing, -1 (as an int) silently becomes `SIZE_MAX` when cast to `size_t`, which caused a production SEV once already. And if `commSize_` ever leaks beyond these guards, then 0 is harmless, while -1 is catastrophic (can crash the workload/application). **Concession:** 0 is more of a soft sentinel insofar as it is a valid communicator size, and so you cannot distinguish "uninitialized" from "genuinely zero participants" by looking at `commSize_` alone. But you can just look at `initState_` for this. Differential Revision: D100673963

…:reconfigure initialization state (meta-pytorch#2071) Summary: Pull Request resolved: meta-pytorch#2071 Add `isInitialized()` virtual method to `TorchCommBackend` so callers can check whether a backend is ready for collective operations before issuing them. The MCCL implementation now correctly ties initialization state to a successful `reconfigure()` result rather than unconditionally marking itself initialized, preventing use of a backend whose reconfigure failed. Specifically, we now set `initState_ = UNINITIALIZED` before calling `mccl_comm_->reconfigure()`, then only promote to `INITIALIZED` inside the existing success block (where `result->code == commSuccess`) alongside `rank_` and `commSize_` updates. Previously `initState_ = INITIALIZED` was set unconditionally after `createWork()`, which allowed subsequent operations to proceed on a broken communicator after a failed reconfigure. **We also** reset `commSize_` to 0 instead of -1, and we move the reset of `commSize_` and `rank_` to be with the `initState_ = UNITIALIZED`. For one thing, -1 (as an int) silently becomes `SIZE_MAX` when cast to `size_t`, which caused a production SEV once already. And if `commSize_` ever leaks beyond these guards, then 0 is harmless, while -1 is catastrophic (can crash the workload/application). **Concession:** 0 is more of a soft sentinel insofar as it is a valid communicator size, and so you cannot distinguish "uninitialized" from "genuinely zero participants" by looking at `commSize_` alone. But you can just look at `initState_` for this. Differential Revision: D100673963

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 14, 2026

meta-codesync bot added fb-exported meta-exported labels Apr 14, 2026

meta-codesync bot changed the title ~~Add isInitialized() API to TorchCommBackend and update TorchCommMCCL::reconfigure initialization state~~ Add isInitialized() API to TorchCommBackend and update TorchCommMCCL::reconfigure initialization state (#2071) Apr 14, 2026

Scusemua force-pushed the export-D100673963 branch from 231c3d6 to 591b242 Compare April 14, 2026 20:28

Scusemua force-pushed the export-D100673963 branch from 591b242 to 8bc6040 Compare April 14, 2026 20:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add isInitialized() API to TorchCommBackend and update TorchCommMCCL::reconfigure initialization state (#2071)#2071

Add isInitialized() API to TorchCommBackend and update TorchCommMCCL::reconfigure initialization state (#2071)#2071
Scusemua wants to merge 1 commit intometa-pytorch:mainfrom
Scusemua:export-D100673963

Scusemua commented Apr 14, 2026 •

edited by meta-codesync bot

Loading

Uh oh!

meta-codesync bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Scusemua commented Apr 14, 2026 • edited by meta-codesync bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Scusemua commented Apr 14, 2026 •

edited by meta-codesync bot

Loading