Skip to content

[Issue]: One-rank avg reduction missing elements #1950

@kwen2501

Description

@kwen2501

How is this issue impacting you?

Data corruption

Share Your Debug Logs

This is an instantiation of a PyTorch issue.

More details here: pytorch/pytorch#168092

Steps to Reproduce the Issue

No response

NCCL Version

NCCL version 2.27.5+cuda12.9

Your platform details

H100

Error Message & Behavior

Last 8 elements of the buffer seems un-processed:

msg=f"Last entries are: {output[-10:]} vs {input_tensor[-10:]}",

AssertionError: Last entries are: tensor([-1.1953,  0.2295,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
         0.0000,  0.0000], device='cuda:0', dtype=torch.bfloat16) vs tensor([-1.1953,  0.2295,  1.6016,  0.2891,  1.9844, -0.5703, -0.1904,  0.6172,
         1.4141,  1.2578], device='cuda:0', dtype=torch.bfloat16)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions