-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Closed
Labels
Description
How is this issue impacting you?
Data corruption
Share Your Debug Logs
This is an instantiation of a PyTorch issue.
More details here: pytorch/pytorch#168092
Steps to Reproduce the Issue
No response
NCCL Version
NCCL version 2.27.5+cuda12.9
Your platform details
H100
Error Message & Behavior
Last 8 elements of the buffer seems un-processed:
msg=f"Last entries are: {output[-10:]} vs {input_tensor[-10:]}",
AssertionError: Last entries are: tensor([-1.1953, 0.2295, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000], device='cuda:0', dtype=torch.bfloat16) vs tensor([-1.1953, 0.2295, 1.6016, 0.2891, 1.9844, -0.5703, -0.1904, 0.6172,
1.4141, 1.2578], device='cuda:0', dtype=torch.bfloat16)
Reactions are currently unavailable