Skip to content

NCCL Error 1: unhandled cuda error #9

@ShuJackson

Description

@ShuJackson

When I run the training script, I ran into an instance of 'std::runtime_error'
what(): NCCL Error 1: unhandled cuda error
./run.sh

This happens every time in the Evaluation step of the train.py script - after the 'convert squad examples to features' step completes successfully and right after 'Evaluating: 0%' is printed.

I have made sure torch can pick up the cuda info:

print(torch.cuda.is_available())
True

image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions