-
Notifications
You must be signed in to change notification settings - Fork 140
Open
Description
When I run the training script, I ran into an instance of 'std::runtime_error'
what(): NCCL Error 1: unhandled cuda error
./run.sh
This happens every time in the Evaluation step of the train.py script - after the 'convert squad examples to features' step completes successfully and right after 'Evaluating: 0%' is printed.
I have made sure torch can pick up the cuda info:
print(torch.cuda.is_available())
True
Metadata
Metadata
Assignees
Labels
No labels
