To set up a GPU-accelerated deep learning environment using:
- ROCm 6.4
- PyTorch (ROCm build)
- DGL (Deep Graph Library)
- Docker containerization
This approach avoids dependency conflicts and ensures compatibility with AMD GPUs.
Host System
- Linux distribution: Ubuntu 22.04 / 24.04 (adjust as appropriate)
- ROCm installed: 6.4.x
- GPU: AMD GPU compatible with ROCm
- Docker: Installed and verified
Container Base Image
- Repository:
rocm/dgl - Example tag used:
dgl-2.4_rocm6.4_ubuntu22.04_py3.10_pytorch_release_2.4.1
After installing Docker, verification was performed using:
sudo docker run hello-worldSuccessful execution confirmed:
- Docker daemon is running
- User has permission to execute Docker commands
The official AMD DGL image was pulled from Docker Hub:
sudo docker pull rocm/dgl:dgl-2.4_rocm6.4_ubuntu22.04_py3.10_pytorch_release_2.4.1This image includes:
- ROCm 6.4 runtime
- PyTorch 2.4.1 (ROCm build)
- DGL preinstalled
- Python 3.10
- Ubuntu 22.04 base
To enable ROCm GPU access inside Docker, the container was launched with the following command:
sudo docker run -it \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
--device=/dev/kfd \
--device=/dev/dri \
--group-add video \
--ipc=host \
--shm-size 8G \
rocm/dgl:dgl-2.4_rocm6.4_ubuntu22.04_py3.10_pytorch_release_2.4.1| Flag | Purpose |
|---|---|
--device=/dev/kfd |
Exposes AMD compute device |
--device=/dev/dri |
Exposes GPU rendering interface |
--group-add video |
Grants GPU access permissions |
--ipc=host |
Improves shared memory handling |
--shm-size 8G |
Prevents DataLoader memory issues |
--cap-add=SYS_PTRACE |
Required for debugging |
rocminfo | headConfirms GPU is visible.
python3 -c "import torch; print(torch.__version__, torch.version.hip)"Expected:
- Correct PyTorch version
- HIP version not None
Check GPU availability:
python3 -c "import torch; print(torch.cuda.is_available())"Expected output:
True
(Note: ROCm uses torch.cuda namespace.)
export DGLBACKEND=pytorch
python3 - << 'EOF'
import dgl
print("DGL version:", dgl.__version__)
EOFSuccessful import confirms:
- DGL properly installed
- Linked correctly with PyTorch backend
Using Docker provides:
- Version isolation (no system contamination)
- Guaranteed compatibility (prevalidated by AMD)
- Reproducibility across machines
- Simplified dependency management
This avoids:
- Pip version conflicts
- ROCm wheel mismatch errors
- Manual compilation complexity
To use local project files:
sudo docker run -it \
--device=/dev/kfd \
--device=/dev/dri \
--group-add video \
--ipc=host \
--shm-size 8G \
-v $(pwd):/workspace \
rocm/dgl:dgl-2.4_rocm6.4_ubuntu22.04_py3.10_pytorch_release_2.4.1This mounts the current directory into /workspace inside the container.
The Docker-based ROCm installation successfully provides:
- Stable PyTorch + ROCm 6.4 integration
- Working DGL framework
- GPU acceleration inside container
- Controlled and reproducible environment
This setup is recommended for research workflows requiring ROCm compatibility and graph-based deep learning.
Q2: Will you run experiments on multiple machines (requiring reproducibility documentation)? Q3: Do you plan to benchmark GPU utilization to validate performance inside Docker?