Skip to content

vLLM creates unnecessary NCCL communicator when inference_world_size=1 (wastes EFA QPs) #1972

@dmvevents

Description

@dmvevents

Summary

When using non-colocated inference (colocated_inference=false) with inference_world_size=1 (single vLLM GPU), NeMo RL still creates a PyNcclCommunicator in vllm_backend.py::init_collective() for weight transfer. This NCCL communicator is unnecessary because:

  1. With only 1 inference GPU, there is no inter-node weight broadcast needed
  2. The same weight transfer can be done via the existing ZMQ/IPC path (used in colocated mode)

On EFA instances with limited NICs (e.g., g6e.8xlarge with 1 EFA NIC), this unnecessary NCCL communicator consumes Queue Pair (QP) budget, causing the important DTensor policy training communicator (used for gradient AllReduce) to fall back to Socket transport, dramatically reducing training throughput.

Reproduction

# Config that triggers the issue
cluster:
  num_nodes: 2
  gpus_per_node: 1

inference:
  num_nodes: 1
  gpus_per_node: 1
  colocated: false

On g6e.8xlarge (1 EFA NIC):

  1. vLLM init_collective() creates PyNcclCommunicator — consumes EFA QPs
  2. DTensor init_process_group("nccl") — falls back to Socket (no QPs left)
  3. Training AllReduce runs over Socket at ~1 Gbps instead of EFA at ~100 Gbps

Code path

nemo_rl/algorithms/grpo.pysetup():

# This runs unconditionally when colocated_inference=False, even when
# inference_world_size=1 and no NCCL broadcast is needed
if not colocated_inference:
    futures_train = policy.init_collective(ip, port, world_size, ...)
    futures_inference = policy_generation.init_collective(ip, port, world_size, ...)
    ray.get(futures_train + futures_inference)

nemo_rl/models/generation/vllm/vllm_backend.pyinit_collective():

def init_collective(self, rank_prefix, ip, port, world_size, train_world_size):
    from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator
    # Always creates NCCL communicator, even when inference has only 1 GPU
    self.model_update_group = PyNcclCommunicator(pg, device=self.device)

Proposed fix

When inference_world_size <= 1:

  1. grpo.py setup(): Skip init_collective() entirely, set a flag
  2. grpo.py refit_policy_generation(): Use the ZMQ/IPC weight transfer path (same as colocated) instead of NCCL broadcast
  3. vllm_backend.py init_collective(): Guard — if inference_world_size <= 1, set self.model_update_group = None and return
  4. vllm_backend.py update_weights_from_collective(): Return True immediately if model_update_group is None

We have a working patch script (patch_vllm_skip_nccl.py) that implements all four changes. Happy to submit a PR.

Impact

  • On g6e.8xlarge: Training AllReduce goes from Socket (~1 Gbps) to EFA (~100 Gbps)
  • On instances with more EFA NICs: Frees QPs for other communicators, reduces NCCL init time
  • No behavior change for inference_world_size > 1 or colocated_inference = true

Environment

  • NeMo RL v0.5.0 (NGC container nvcr.io/nvidia/nemo-rl:v0.5.0)
  • Instance: g6e.8xlarge (1x L40S, 1x EFA NIC)
  • Also applicable to any EFA instance with limited NIC count

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions