Skip to content

Add reconfigurable support to match MCCL behavior (#2077)#2077

Open
d4l3k wants to merge 1 commit intometa-pytorch:mainfrom
d4l3k:export-D100904495
Open

Add reconfigurable support to match MCCL behavior (#2077)#2077
d4l3k wants to merge 1 commit intometa-pytorch:mainfrom
d4l3k:export-D100904495

Conversation

@d4l3k
Copy link
Copy Markdown
Member

@d4l3k d4l3k commented Apr 15, 2026

Summary:

Implements fault-tolerant reconfiguration for the Gloo backend by adding store-based atomic rank assignment during reconfigure(). Each process now obtains a unique rank via rankStore->add("rank_counter", 1) - 1, scoped to a per-reconfigure UUID prefix, matching MCCL's dynamic regime behavior. The implementation defers context initialization when enable_reconfigure is set and resets state during reconfiguration.

Adds comprehensive test coverage including basic reconfigure, unordered handle sets, send/recv after reconfigure, and allreduce verification.


AI generated Summary & Test Plan from DEV175861935

Reviewed By: dolpm

Differential Revision: D100904495

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 15, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync bot commented Apr 15, 2026

@d4l3k has exported this pull request. If you are a Meta employee, you can view the originating Diff in D100904495.

@meta-codesync meta-codesync bot changed the title Add reconfigurable support to match MCCL behavior Add reconfigurable support to match MCCL behavior (#2077) Apr 15, 2026
d4l3k added a commit to d4l3k/torchcomms-1 that referenced this pull request Apr 15, 2026
Summary:

Implements fault-tolerant reconfiguration for the Gloo backend by adding store-based atomic rank assignment during `reconfigure()`. Each process now obtains a unique rank via `rankStore->add("rank_counter", 1) - 1`, scoped to a per-reconfigure UUID prefix, matching MCCL's dynamic regime behavior. The implementation defers context initialization when `enable_reconfigure` is set and resets state during reconfiguration.

Adds comprehensive test coverage including basic reconfigure, unordered handle sets, send/recv after reconfigure, and allreduce verification.

---
AI generated Summary & Test Plan from DEV175861935

Reviewed By: dolpm

Differential Revision: D100904495
@d4l3k d4l3k force-pushed the export-D100904495 branch from ec099d9 to 039db9b Compare April 15, 2026 03:42
Summary:

Implements fault-tolerant reconfiguration for the Gloo backend by adding store-based atomic rank assignment during `reconfigure()`. Each process now obtains a unique rank via `rankStore->add("rank_counter", 1) - 1`, scoped to a per-reconfigure UUID prefix, matching MCCL's dynamic regime behavior. The implementation defers context initialization when `enable_reconfigure` is set and resets state during reconfiguration.

Adds comprehensive test coverage including basic reconfigure, unordered handle sets, send/recv after reconfigure, and allreduce verification.

---
AI generated Summary & Test Plan from DEV175861935

Reviewed By: dolpm

Differential Revision: D100904495
@d4l3k d4l3k force-pushed the export-D100904495 branch from 039db9b to bfcabfe Compare April 15, 2026 03:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant