[Bugfix] Resolve Rank index out of range during BWD when sp_size < world_size in Ulysses by Flink-ddd · Pull Request #7809 · deepspeedai/DeepSpeed

Flink-ddd · 2026-01-23T15:54:20Z

Description

This PR addresses Issue #7672.

When sequence_parallel_size is smaller than world_size (e.g., sp_size=2 on 4 GPUs) with PyTorch < 2.3, using torch.distributed.nn.functional.all_gather for loss aggregation triggers an IndexError: tuple index out of range during the backward pass. This is due to a known PyTorch issue where the backward hook accesses the global rank instead of the group rank.

Solution

Regression Test & Workaround: Updated the regression test TestUlyssesLossBackward to implement a Weighted All-Reduce pattern.

Before: all_gather -> manual sum (Vulnerable to rank indexing mismatch on older PyTorch).
After: all_reduce(weighted_loss) / all_reduce(total_weight) (Robust and supports weighted averaging).

Runtime Warning: Added a version check (required_torch_version) in DeepSpeedEngine. It now logs a warning if Sequence Parallelism is enabled on PyTorch < 2.3, providing a link to the workaround test case.
Documentation: Updated ulysses-alst-sequence-parallelism.md with a note regarding legacy PyTorch versions and the recommended workaround.

Verification

Added and verified the regression test tests/unit/sequence_parallelism/test_ulysses.py which now validates the weighted averaging logic.

1. Reproduction (Before Fix)
Confirmed IndexError crash on Rank 2/3 with sp_size=2 on a 4-GPU setup.

2. Verification (After Fix)
Verified the fix using the regression test logic on 4x RTX A6000. The backward pass now completes successfully on all ranks without error.

Signed-off-by: vensen <vensenmu@gmail.com>

tohtana · 2026-01-23T20:11:51Z

@Flink-ddd Thank you for opening this PR! I only see changes in tests. Did you miss committing some changes?

Flink-ddd · 2026-01-24T03:26:12Z

Hi @tohtana , Thanks for the review. There are no missing commits. The issue reported in #7672 stems from using all_gather for loss aggregation in the user's training loop, rather than a bug within DeepSpeed's internal runtime.

Since we cannot patch user scripts directly, I submitted this regression test to:

Verify that the correct approach (using all_reduce) works stably when sp_size < world_size.
Prevent future regressions or confusion regarding this usage pattern.

but, pls tell me if you have other option.

tohtana · 2026-01-24T19:05:07Z

Thank you for your clarification, @Flink-ddd!I It looks like a bug in PyTorch. In AllGather’s backward pass, we should use the local rank within the given process group. It appears this was fixed in v2.3.

v2.3: rank = dist.get_rank(group=ctx.group)
v2.2: rank = dist.get_rank()

As you said, we can’t force client code to implement loss calculation in a particular way. So I’m wondering whether we should simply add an assertion to check the PyTorch version when SP is enabled. We could also note that SP requires v2.3 or later in the document, even though the DeepSpeed code itself doesn’t have an issue with older versions.

It would still be good to add a regression test. One concern is that the all-reduce approach can’t implement weighted loss averaging, which is used in the original example.

What are your thoughts?

Flink-ddd · 2026-01-25T08:33:44Z

Hi @tohtana Thanks for the suggestion. I agree that simulating the weighted averaging pattern is better for real-world scenarios. I will update the test case to implement the weighted all-reduce pattern (reducing both the weighted loss and total weights separately) to address this.

tohtana · 2026-01-25T08:35:57Z

Hi @Flink-ddd
Do you think we should support SP with v2.2 or older?

Flink-ddd · 2026-01-25T09:04:02Z

Hi @tohtana , Yes, I believe we should. Many production environments and clusters are still pinned to PyTorch v2.1 or v2.2 due to CUDA driver constraints or stability requirements. maintaining support for SP on these older versions adds significant value to DeepSpeed's compatibility. This regression test ensures that we continue to support these users stably. However, it depends on your perspective. If you think it's unnecessary, we can set assert torch.version >= 2.3 and then turn off this PR.

tohtana · 2026-01-25T17:33:06Z

Okay, then let's keep the older versions. But adding a regression test doesn't prevent the strange error from confusing users. How about these?

Update your new regression test for weighted loss averaging
Add a note in the tutorial about the allgather issue and a link to your new regression test (or the code snippet of your solution)
Show a warning message during initialization when SP is enabled, telling users that all-gather might cause an issue, and include a link. We can also link to the new test case (or the relevant section of the tutorial).

Signed-off-by: vensen <vensenmu@gmail.com>

Flink-ddd · 2026-01-26T10:46:54Z

Hi @tohtana , Thanks for your aadvise, That sounds like a solid plan. I agree that adding a warning and updating the documentation will greatly improve the user experience for those on older PyTorch versions. I have already pushed these update.

tohtana

Thank you for the update! I left one more comment. Once it is fixed and tests pass, let's merge this.

deepspeed/runtime/engine.py

Signed-off-by: vensen <vensenmu@gmail.com>

Flink-ddd · 2026-01-28T02:40:50Z

Hi @tohtana , ready for review again, Thank you for suggestion and help.

Flink-ddd · 2026-01-28T10:24:56Z

Hi @tohtana , all CI tests are successful now. Could you please help approve it when you are free? Thanks.

tohtana · 2026-01-28T17:00:43Z

Merged. Thank you for your contribution, @Flink-ddd!

…rld_size in Ulysses (deepspeedai#7809) ### Description This PR addresses Issue deepspeedai#7672. When sequence_parallel_size is smaller than world_size (e.g., sp_size=2 on 4 GPUs) with PyTorch < 2.3, using torch.distributed.nn.functional.all_gather for loss aggregation triggers an IndexError: tuple index out of range during the backward pass. This is due to a known PyTorch issue where the backward hook accesses the global rank instead of the group rank. ### Solution 1. Regression Test & Workaround: Updated the regression test TestUlyssesLossBackward to implement a Weighted All-Reduce pattern. - Before: all_gather -> manual sum (Vulnerable to rank indexing mismatch on older PyTorch). - After: all_reduce(weighted_loss) / all_reduce(total_weight) (Robust and supports weighted averaging). 2. Runtime Warning: Added a version check (required_torch_version) in DeepSpeedEngine. It now logs a warning if Sequence Parallelism is enabled on PyTorch < 2.3, providing a link to the workaround test case. 3. Documentation: Updated ulysses-alst-sequence-parallelism.md with a note regarding legacy PyTorch versions and the recommended workaround. ### Verification Added and verified the regression test tests/unit/sequence_parallelism/test_ulysses.py which now validates the weighted averaging logic. **1. Reproduction (Before Fix)** Confirmed IndexError crash on Rank 2/3 with sp_size=2 on a 4-GPU setup. <img width="1370" height="860" alt="Screenshot 2026-01-23 at 23 53 42" src="https://github.com/user-attachments/assets/f4005c02-ff6c-46ea-a1a7-caac2093128b" /> **2. Verification (After Fix)** Verified the fix using the regression test logic on 4x RTX A6000. The backward pass now completes successfully on all ranks without error. <img width="1192" height="605" alt="Screenshot 2026-01-23 at 23 52 54" src="https://github.com/user-attachments/assets/c14cd093-67b7-42b0-ae15-65555c129082" /> --------- Signed-off-by: vensen <vensenmu@gmail.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Signed-off-by: Phalani Paladugu <mailofphalani@gmail.com>

…rld_size in Ulysses (deepspeedai#7809) ### Description This PR addresses Issue deepspeedai#7672. When sequence_parallel_size is smaller than world_size (e.g., sp_size=2 on 4 GPUs) with PyTorch < 2.3, using torch.distributed.nn.functional.all_gather for loss aggregation triggers an IndexError: tuple index out of range during the backward pass. This is due to a known PyTorch issue where the backward hook accesses the global rank instead of the group rank. ### Solution 1. Regression Test & Workaround: Updated the regression test TestUlyssesLossBackward to implement a Weighted All-Reduce pattern. - Before: all_gather -> manual sum (Vulnerable to rank indexing mismatch on older PyTorch). - After: all_reduce(weighted_loss) / all_reduce(total_weight) (Robust and supports weighted averaging). 2. Runtime Warning: Added a version check (required_torch_version) in DeepSpeedEngine. It now logs a warning if Sequence Parallelism is enabled on PyTorch < 2.3, providing a link to the workaround test case. 3. Documentation: Updated ulysses-alst-sequence-parallelism.md with a note regarding legacy PyTorch versions and the recommended workaround. ### Verification Added and verified the regression test tests/unit/sequence_parallelism/test_ulysses.py which now validates the weighted averaging logic. **1. Reproduction (Before Fix)** Confirmed IndexError crash on Rank 2/3 with sp_size=2 on a 4-GPU setup. <img width="1370" height="860" alt="Screenshot 2026-01-23 at 23 53 42" src="https://github.com/user-attachments/assets/f4005c02-ff6c-46ea-a1a7-caac2093128b" /> **2. Verification (After Fix)** Verified the fix using the regression test logic on 4x RTX A6000. The backward pass now completes successfully on all ranks without error. <img width="1192" height="605" alt="Screenshot 2026-01-23 at 23 52 54" src="https://github.com/user-attachments/assets/c14cd093-67b7-42b0-ae15-65555c129082" /> --------- Signed-off-by: vensen <vensenmu@gmail.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Signed-off-by: Kento Sugama <kentosugama@protonmail.ch>

Flink-ddd requested review from loadams, tjruwase and tohtana as code owners January 23, 2026 15:54

fix: Rank index out of range during BWD when sp_size < world_size

4dc7846

Signed-off-by: vensen <vensenmu@gmail.com>

Flink-ddd force-pushed the fix/issue-7672-ulysses-sp-backward-stability branch from 2b386ab to 4dc7846 Compare January 23, 2026 15:58

Add legacy PyTorch warning with link and doc note for SP

b555aa5

Signed-off-by: vensen <vensenmu@gmail.com>

tohtana reviewed Jan 27, 2026

View reviewed changes

deepspeed/runtime/engine.py Show resolved Hide resolved

Use required_torch_version utility for version check

5b1d854

Signed-off-by: vensen <vensenmu@gmail.com>

Merge branch 'master' into fix/issue-7672-ulysses-sp-backward-stability

2e06a04

tohtana enabled auto-merge (squash) January 28, 2026 07:19

Flink-ddd requested a review from tohtana January 28, 2026 10:25

tohtana approved these changes Jan 28, 2026

View reviewed changes

tohtana merged commit bb250a2 into deepspeedai:master Jan 28, 2026
13 checks passed

Conversation

Flink-ddd commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Solution

Verification

Uh oh!

tohtana commented Jan 23, 2026

Uh oh!

Flink-ddd commented Jan 24, 2026

Uh oh!

tohtana commented Jan 24, 2026

Uh oh!

Flink-ddd commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tohtana commented Jan 25, 2026

Uh oh!

Flink-ddd commented Jan 25, 2026

Uh oh!

tohtana commented Jan 25, 2026

Uh oh!

Flink-ddd commented Jan 26, 2026

Uh oh!

tohtana left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Flink-ddd commented Jan 28, 2026

Uh oh!

Flink-ddd commented Jan 28, 2026

Uh oh!

Uh oh!

tohtana commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Flink-ddd commented Jan 23, 2026 •

edited

Loading

Flink-ddd commented Jan 25, 2026 •

edited

Loading