-
Notifications
You must be signed in to change notification settings - Fork 291
Description
Summary
The NGC NeMo RL v0.5.0 container (nvcr.io/nvidia/nemo-rl:v0.5.0) does not include /opt/amazon/aws-ofi-nccl/lib or /opt/amazon/efa/lib in its LD_LIBRARY_PATH. When running on AWS EFA instances (P5, P5en, G6E, etc.), NCCL cannot find the libnccl-net.so OFI plugin and falls back to the built-in Socket transport for all inter-node communication.
This means multi-node training runs at Socket speeds (~1-5 Gbps) instead of EFA speeds (~100-400 Gbps per NIC), making multi-node GRPO training impractically slow.
Default LD_LIBRARY_PATH in NGC v0.5.0
/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
Missing paths:
/opt/amazon/aws-ofi-nccl/lib— containslibnccl-net.so(the NCCL OFI plugin)/opt/amazon/efa/lib— containslibfabric.so(required by the OFI plugin)
How NCCL discovers the OFI plugin
NCCL searches for libnccl-net.so via:
NCCL_NET_PLUGINenv var (if set, used as a dlopen path hint)- Standard
dlopen()search:LD_LIBRARY_PATH,/etc/ld.so.conf.d/, default paths
When neither path is in LD_LIBRARY_PATH and no ld.so.conf.d entry exists, dlopen("libnccl-net.so") fails silently and NCCL falls back to Socket.
Symptom in NCCL logs (NCCL_DEBUG=INFO):
NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
NCCL INFO NET/Socket: Using [eth0:10.0.x.x]
Expected (with correct LD_LIBRARY_PATH):
NCCL INFO NET/OFI aws-ofi-nccl v1.x.x
NCCL INFO NET/OFI Using Amazon EFA
Proposed fix
Add the EFA and aws-ofi-nccl library paths to the container's LD_LIBRARY_PATH in the Dockerfile:
ENV LD_LIBRARY_PATH="/opt/amazon/aws-ofi-nccl/lib:/opt/amazon/efa/lib${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}"This is a no-op on non-EFA instances (directories don't exist, no libraries loaded). On EFA instances where the EFA installer has been mounted or installed, NCCL immediately discovers the OFI plugin.
Alternative: Add an ld.so.conf.d entry and run ldconfig:
RUN echo "/opt/amazon/aws-ofi-nccl/lib" > /etc/ld.so.conf.d/aws-ofi-nccl.conf && \
echo "/opt/amazon/efa/lib" > /etc/ld.so.conf.d/efa.conf && \
ldconfigCurrent workaround
Users must add the paths manually in their Kubernetes manifests, Docker run commands, or entrypoint scripts:
# Kubernetes pod spec
env:
- name: LD_LIBRARY_PATH
value: "/opt/amazon/aws-ofi-nccl/lib:/opt/amazon/efa/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64"Or in their custom Dockerfiles (which is what we do):
FROM nvcr.io/nvidia/nemo-rl:v0.5.0
ENV LD_LIBRARY_PATH="/opt/amazon/aws-ofi-nccl/lib:/opt/amazon/efa/lib${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}"Impact
All AWS EFA instances running multi-node NeMo RL training with the NGC container are affected:
- P5.48xlarge (32x EFA): Falls back to Socket
- P5en.48xlarge (16x EFA): Falls back to Socket
- G6E (1x EFA): Falls back to Socket
Environment
- Container:
nvcr.io/nvidia/nemo-rl:v0.5.0 - Instances tested: P5.48xlarge, P5en.48xlarge, g6e.8xlarge
- EFA installer: 1.38.0 (libfabric 2.4.0)
- aws-ofi-nccl: 1.14.0 (host-mounted) and 1.18.0 (built from source)