Skip to content

NGC v0.5.0 container LD_LIBRARY_PATH missing /opt/amazon/aws-ofi-nccl/lib (NCCL falls to Socket on EFA) #1973

@dmvevents

Description

@dmvevents

Summary

The NGC NeMo RL v0.5.0 container (nvcr.io/nvidia/nemo-rl:v0.5.0) does not include /opt/amazon/aws-ofi-nccl/lib or /opt/amazon/efa/lib in its LD_LIBRARY_PATH. When running on AWS EFA instances (P5, P5en, G6E, etc.), NCCL cannot find the libnccl-net.so OFI plugin and falls back to the built-in Socket transport for all inter-node communication.

This means multi-node training runs at Socket speeds (~1-5 Gbps) instead of EFA speeds (~100-400 Gbps per NIC), making multi-node GRPO training impractically slow.

Default LD_LIBRARY_PATH in NGC v0.5.0

/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64

Missing paths:

  • /opt/amazon/aws-ofi-nccl/lib — contains libnccl-net.so (the NCCL OFI plugin)
  • /opt/amazon/efa/lib — contains libfabric.so (required by the OFI plugin)

How NCCL discovers the OFI plugin

NCCL searches for libnccl-net.so via:

  1. NCCL_NET_PLUGIN env var (if set, used as a dlopen path hint)
  2. Standard dlopen() search: LD_LIBRARY_PATH, /etc/ld.so.conf.d/, default paths

When neither path is in LD_LIBRARY_PATH and no ld.so.conf.d entry exists, dlopen("libnccl-net.so") fails silently and NCCL falls back to Socket.

Symptom in NCCL logs (NCCL_DEBUG=INFO):

NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
NCCL INFO NET/Socket: Using [eth0:10.0.x.x]

Expected (with correct LD_LIBRARY_PATH):

NCCL INFO NET/OFI aws-ofi-nccl v1.x.x
NCCL INFO NET/OFI Using Amazon EFA

Proposed fix

Add the EFA and aws-ofi-nccl library paths to the container's LD_LIBRARY_PATH in the Dockerfile:

ENV LD_LIBRARY_PATH="/opt/amazon/aws-ofi-nccl/lib:/opt/amazon/efa/lib${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}"

This is a no-op on non-EFA instances (directories don't exist, no libraries loaded). On EFA instances where the EFA installer has been mounted or installed, NCCL immediately discovers the OFI plugin.

Alternative: Add an ld.so.conf.d entry and run ldconfig:

RUN echo "/opt/amazon/aws-ofi-nccl/lib" > /etc/ld.so.conf.d/aws-ofi-nccl.conf && \
    echo "/opt/amazon/efa/lib" > /etc/ld.so.conf.d/efa.conf && \
    ldconfig

Current workaround

Users must add the paths manually in their Kubernetes manifests, Docker run commands, or entrypoint scripts:

# Kubernetes pod spec
env:
  - name: LD_LIBRARY_PATH
    value: "/opt/amazon/aws-ofi-nccl/lib:/opt/amazon/efa/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64"

Or in their custom Dockerfiles (which is what we do):

FROM nvcr.io/nvidia/nemo-rl:v0.5.0
ENV LD_LIBRARY_PATH="/opt/amazon/aws-ofi-nccl/lib:/opt/amazon/efa/lib${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}"

Impact

All AWS EFA instances running multi-node NeMo RL training with the NGC container are affected:

  • P5.48xlarge (32x EFA): Falls back to Socket
  • P5en.48xlarge (16x EFA): Falls back to Socket
  • G6E (1x EFA): Falls back to Socket

Environment

  • Container: nvcr.io/nvidia/nemo-rl:v0.5.0
  • Instances tested: P5.48xlarge, P5en.48xlarge, g6e.8xlarge
  • EFA installer: 1.38.0 (libfabric 2.4.0)
  • aws-ofi-nccl: 1.14.0 (host-mounted) and 1.18.0 (built from source)

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions