MultiGPU Support (DataParallel)

Hi Fangzhou,

Thank you for your excellent work. The codebase is well-organized and easy to follow. 


When I tried to train `mini-imagenet` using either 2 - 8 GPUs by the following command,
```sh
python train.py --config=configs/convnet4/mini-imagenet/5_way_1_shot/train_reproduce.yaml --gpu=0,1,2,3
``` 
python keeps reporting errors shown as below,
```sh
meta-train set: torch.Size([5, 3, 84, 84]) (x800), 64
meta-val set: torch.Size([5, 3, 84, 84]) (x800), 16
num params: 32.9K
Traceback (most recent call last):
  File "train.py", line 265, in <module>
    main(config)
  File "train.py", line 130, in main
    logits = model(x_shot, x_query, y_shot, inner_args, meta_train=True)
  File "**/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "**/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "**/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "**/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "**/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise
    raise exception
ValueError: Caught ValueError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "**/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "**/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "**/PyTorch-MAML/models/maml.py", line 223, in forward
    updated_params = self._adapt(
  File "**/PyTorch-MAML/models/maml.py", line 185, in _adapt
    params, mom_buffer = self._inner_iter(
  File "**/PyTorch-MAML/models/maml.py", line 99, in _inner_iter
    grads = autograd.grad(loss, params.values(),
  File "**/lib/python3.8/site-packages/torch/autograd/__init__.py", line 234, in grad
    return Variable._execution_engine.run_backward(
ValueError: grad requires non-empty inputs.
```

However, the code only works while using 1 GPU. When `n_episode=4`, I assume the code should work on 2 or 4 GPUs.

Framework Versions:
+ `python: 3.8`
+ `pytorch: 1.10.1 py3.8_cuda11.3_cudnn8.2.0_0`

Our ultimate goal is to transfer this repo to our project but find the same errors reported. Any hints or help are highly appreciated. Thanks!



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiGPU Support (DataParallel) #12

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

MultiGPU Support (DataParallel) #12

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions