-
Notifications
You must be signed in to change notification settings - Fork 35
Open
Description
Hi Fangzhou,
Thank you for your excellent work. The codebase is well-organized and easy to follow.
When I tried to train mini-imagenet using either 2 - 8 GPUs by the following command,
python train.py --config=configs/convnet4/mini-imagenet/5_way_1_shot/train_reproduce.yaml --gpu=0,1,2,3python keeps reporting errors shown as below,
meta-train set: torch.Size([5, 3, 84, 84]) (x800), 64
meta-val set: torch.Size([5, 3, 84, 84]) (x800), 16
num params: 32.9K
Traceback (most recent call last):
File "train.py", line 265, in <module>
main(config)
File "train.py", line 130, in main
logits = model(x_shot, x_query, y_shot, inner_args, meta_train=True)
File "**/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "**/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "**/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "**/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "**/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise
raise exception
ValueError: Caught ValueError in replica 0 on device 0.
Original Traceback (most recent call last):
File "**/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "**/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "**/PyTorch-MAML/models/maml.py", line 223, in forward
updated_params = self._adapt(
File "**/PyTorch-MAML/models/maml.py", line 185, in _adapt
params, mom_buffer = self._inner_iter(
File "**/PyTorch-MAML/models/maml.py", line 99, in _inner_iter
grads = autograd.grad(loss, params.values(),
File "**/lib/python3.8/site-packages/torch/autograd/__init__.py", line 234, in grad
return Variable._execution_engine.run_backward(
ValueError: grad requires non-empty inputs.However, the code only works while using 1 GPU. When n_episode=4, I assume the code should work on 2 or 4 GPUs.
Framework Versions:
python: 3.8pytorch: 1.10.1 py3.8_cuda11.3_cudnn8.2.0_0
Our ultimate goal is to transfer this repo to our project but find the same errors reported. Any hints or help are highly appreciated. Thanks!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels