[Bugfix] Resolve Rank index out of range during BWD when sp_size < world_size in Ulysses#7809
Conversation
Signed-off-by: vensen <vensenmu@gmail.com>
2b386ab to
4dc7846
Compare
|
@Flink-ddd Thank you for opening this PR! I only see changes in tests. Did you miss committing some changes? |
|
Hi @tohtana , Thanks for the review. There are no missing commits. The issue reported in #7672 stems from using all_gather for loss aggregation in the user's training loop, rather than a bug within DeepSpeed's internal runtime. Since we cannot patch user scripts directly, I submitted this regression test to:
but, pls tell me if you have other option. |
|
Thank you for your clarification, @Flink-ddd!I It looks like a bug in PyTorch. In AllGather’s backward pass, we should use the local rank within the given process group. It appears this was fixed in v2.3. As you said, we can’t force client code to implement loss calculation in a particular way. So I’m wondering whether we should simply add an assertion to check the PyTorch version when SP is enabled. We could also note that SP requires v2.3 or later in the document, even though the DeepSpeed code itself doesn’t have an issue with older versions. It would still be good to add a regression test. One concern is that the all-reduce approach can’t implement weighted loss averaging, which is used in the original example. What are your thoughts? |
|
Hi @tohtana Thanks for the suggestion. I agree that simulating the weighted averaging pattern is better for real-world scenarios. I will update the test case to implement the weighted all-reduce pattern (reducing both the weighted loss and total weights separately) to address this. |
|
Hi @Flink-ddd |
|
Hi @tohtana , Yes, I believe we should. Many production environments and clusters are still pinned to PyTorch v2.1 or v2.2 due to CUDA driver constraints or stability requirements. maintaining support for SP on these older versions adds significant value to DeepSpeed's compatibility. This regression test ensures that we continue to support these users stably. However, it depends on your perspective. If you think it's unnecessary, we can set |
|
Okay, then let's keep the older versions. But adding a regression test doesn't prevent the strange error from confusing users. How about these?
|
Signed-off-by: vensen <vensenmu@gmail.com>
|
Hi @tohtana , Thanks for your aadvise, That sounds like a solid plan. I agree that adding a warning and updating the documentation will greatly improve the user experience for those on older PyTorch versions. I have already pushed these update. |
tohtana
left a comment
There was a problem hiding this comment.
Thank you for the update! I left one more comment. Once it is fixed and tests pass, let's merge this.
Signed-off-by: vensen <vensenmu@gmail.com>
|
Hi @tohtana , ready for review again, Thank you for suggestion and help. |
|
Hi @tohtana , all CI tests are successful now. Could you please help approve it when you are free? Thanks. |
|
Merged. Thank you for your contribution, @Flink-ddd! |
…rld_size in Ulysses (deepspeedai#7809) ### Description This PR addresses Issue deepspeedai#7672. When sequence_parallel_size is smaller than world_size (e.g., sp_size=2 on 4 GPUs) with PyTorch < 2.3, using torch.distributed.nn.functional.all_gather for loss aggregation triggers an IndexError: tuple index out of range during the backward pass. This is due to a known PyTorch issue where the backward hook accesses the global rank instead of the group rank. ### Solution 1. Regression Test & Workaround: Updated the regression test TestUlyssesLossBackward to implement a Weighted All-Reduce pattern. - Before: all_gather -> manual sum (Vulnerable to rank indexing mismatch on older PyTorch). - After: all_reduce(weighted_loss) / all_reduce(total_weight) (Robust and supports weighted averaging). 2. Runtime Warning: Added a version check (required_torch_version) in DeepSpeedEngine. It now logs a warning if Sequence Parallelism is enabled on PyTorch < 2.3, providing a link to the workaround test case. 3. Documentation: Updated ulysses-alst-sequence-parallelism.md with a note regarding legacy PyTorch versions and the recommended workaround. ### Verification Added and verified the regression test tests/unit/sequence_parallelism/test_ulysses.py which now validates the weighted averaging logic. **1. Reproduction (Before Fix)** Confirmed IndexError crash on Rank 2/3 with sp_size=2 on a 4-GPU setup. <img width="1370" height="860" alt="Screenshot 2026-01-23 at 23 53 42" src="https://github.com/user-attachments/assets/f4005c02-ff6c-46ea-a1a7-caac2093128b" /> **2. Verification (After Fix)** Verified the fix using the regression test logic on 4x RTX A6000. The backward pass now completes successfully on all ranks without error. <img width="1192" height="605" alt="Screenshot 2026-01-23 at 23 52 54" src="https://github.com/user-attachments/assets/c14cd093-67b7-42b0-ae15-65555c129082" /> --------- Signed-off-by: vensen <vensenmu@gmail.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Signed-off-by: Phalani Paladugu <mailofphalani@gmail.com>
…rld_size in Ulysses (deepspeedai#7809) ### Description This PR addresses Issue deepspeedai#7672. When sequence_parallel_size is smaller than world_size (e.g., sp_size=2 on 4 GPUs) with PyTorch < 2.3, using torch.distributed.nn.functional.all_gather for loss aggregation triggers an IndexError: tuple index out of range during the backward pass. This is due to a known PyTorch issue where the backward hook accesses the global rank instead of the group rank. ### Solution 1. Regression Test & Workaround: Updated the regression test TestUlyssesLossBackward to implement a Weighted All-Reduce pattern. - Before: all_gather -> manual sum (Vulnerable to rank indexing mismatch on older PyTorch). - After: all_reduce(weighted_loss) / all_reduce(total_weight) (Robust and supports weighted averaging). 2. Runtime Warning: Added a version check (required_torch_version) in DeepSpeedEngine. It now logs a warning if Sequence Parallelism is enabled on PyTorch < 2.3, providing a link to the workaround test case. 3. Documentation: Updated ulysses-alst-sequence-parallelism.md with a note regarding legacy PyTorch versions and the recommended workaround. ### Verification Added and verified the regression test tests/unit/sequence_parallelism/test_ulysses.py which now validates the weighted averaging logic. **1. Reproduction (Before Fix)** Confirmed IndexError crash on Rank 2/3 with sp_size=2 on a 4-GPU setup. <img width="1370" height="860" alt="Screenshot 2026-01-23 at 23 53 42" src="https://github.com/user-attachments/assets/f4005c02-ff6c-46ea-a1a7-caac2093128b" /> **2. Verification (After Fix)** Verified the fix using the regression test logic on 4x RTX A6000. The backward pass now completes successfully on all ranks without error. <img width="1192" height="605" alt="Screenshot 2026-01-23 at 23 52 54" src="https://github.com/user-attachments/assets/c14cd093-67b7-42b0-ae15-65555c129082" /> --------- Signed-off-by: vensen <vensenmu@gmail.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Signed-off-by: Kento Sugama <kentosugama@protonmail.ch>
Description
This PR addresses Issue #7672.
When sequence_parallel_size is smaller than world_size (e.g., sp_size=2 on 4 GPUs) with PyTorch < 2.3, using torch.distributed.nn.functional.all_gather for loss aggregation triggers an IndexError: tuple index out of range during the backward pass. This is due to a known PyTorch issue where the backward hook accesses the global rank instead of the group rank.
Solution
Verification
Added and verified the regression test tests/unit/sequence_parallelism/test_ulysses.py which now validates the weighted averaging logic.
1. Reproduction (Before Fix)

Confirmed IndexError crash on Rank 2/3 with sp_size=2 on a 4-GPU setup.
2. Verification (After Fix)

Verified the fix using the regression test logic on 4x RTX A6000. The backward pass now completes successfully on all ranks without error.