-
Notifications
You must be signed in to change notification settings - Fork 291
Description
Summary
When using non-colocated inference (colocated_inference=false) with inference_world_size=1 (single vLLM GPU), NeMo RL still creates a PyNcclCommunicator in vllm_backend.py::init_collective() for weight transfer. This NCCL communicator is unnecessary because:
- With only 1 inference GPU, there is no inter-node weight broadcast needed
- The same weight transfer can be done via the existing ZMQ/IPC path (used in colocated mode)
On EFA instances with limited NICs (e.g., g6e.8xlarge with 1 EFA NIC), this unnecessary NCCL communicator consumes Queue Pair (QP) budget, causing the important DTensor policy training communicator (used for gradient AllReduce) to fall back to Socket transport, dramatically reducing training throughput.
Reproduction
# Config that triggers the issue
cluster:
num_nodes: 2
gpus_per_node: 1
inference:
num_nodes: 1
gpus_per_node: 1
colocated: falseOn g6e.8xlarge (1 EFA NIC):
- vLLM
init_collective()creates PyNcclCommunicator — consumes EFA QPs - DTensor
init_process_group("nccl")— falls back to Socket (no QPs left) - Training AllReduce runs over Socket at ~1 Gbps instead of EFA at ~100 Gbps
Code path
nemo_rl/algorithms/grpo.py — setup():
# This runs unconditionally when colocated_inference=False, even when
# inference_world_size=1 and no NCCL broadcast is needed
if not colocated_inference:
futures_train = policy.init_collective(ip, port, world_size, ...)
futures_inference = policy_generation.init_collective(ip, port, world_size, ...)
ray.get(futures_train + futures_inference)nemo_rl/models/generation/vllm/vllm_backend.py — init_collective():
def init_collective(self, rank_prefix, ip, port, world_size, train_world_size):
from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator
# Always creates NCCL communicator, even when inference has only 1 GPU
self.model_update_group = PyNcclCommunicator(pg, device=self.device)Proposed fix
When inference_world_size <= 1:
grpo.py setup(): Skipinit_collective()entirely, set a flaggrpo.py refit_policy_generation(): Use the ZMQ/IPC weight transfer path (same as colocated) instead of NCCL broadcastvllm_backend.py init_collective(): Guard — ifinference_world_size <= 1, setself.model_update_group = Noneand returnvllm_backend.py update_weights_from_collective(): ReturnTrueimmediately ifmodel_update_group is None
We have a working patch script (patch_vllm_skip_nccl.py) that implements all four changes. Happy to submit a PR.
Impact
- On g6e.8xlarge: Training AllReduce goes from Socket (~1 Gbps) to EFA (~100 Gbps)
- On instances with more EFA NICs: Frees QPs for other communicators, reduces NCCL init time
- No behavior change for
inference_world_size > 1orcolocated_inference = true
Environment
- NeMo RL v0.5.0 (NGC container
nvcr.io/nvidia/nemo-rl:v0.5.0) - Instance: g6e.8xlarge (1x L40S, 1x EFA NIC)
- Also applicable to any EFA instance with limited NIC count