vLLM creates unnecessary NCCL communicator when inference_world_size=1 (wastes EFA QPs)

## Summary

When using non-colocated inference (`colocated_inference=false`) with `inference_world_size=1` (single vLLM GPU), NeMo RL still creates a `PyNcclCommunicator` in `vllm_backend.py::init_collective()` for weight transfer. This NCCL communicator is unnecessary because:

1. With only 1 inference GPU, there is no inter-node weight broadcast needed
2. The same weight transfer can be done via the existing ZMQ/IPC path (used in colocated mode)

On EFA instances with limited NICs (e.g., g6e.8xlarge with 1 EFA NIC), this unnecessary NCCL communicator consumes Queue Pair (QP) budget, causing the **important** DTensor policy training communicator (used for gradient AllReduce) to fall back to Socket transport, dramatically reducing training throughput.

## Reproduction

```yaml
# Config that triggers the issue
cluster:
  num_nodes: 2
  gpus_per_node: 1

inference:
  num_nodes: 1
  gpus_per_node: 1
  colocated: false
```

On g6e.8xlarge (1 EFA NIC):
1. vLLM `init_collective()` creates PyNcclCommunicator — consumes EFA QPs
2. DTensor `init_process_group("nccl")` — falls back to Socket (no QPs left)
3. Training AllReduce runs over Socket at ~1 Gbps instead of EFA at ~100 Gbps

## Code path

**`nemo_rl/algorithms/grpo.py` — `setup()`:**
```python
# This runs unconditionally when colocated_inference=False, even when
# inference_world_size=1 and no NCCL broadcast is needed
if not colocated_inference:
    futures_train = policy.init_collective(ip, port, world_size, ...)
    futures_inference = policy_generation.init_collective(ip, port, world_size, ...)
    ray.get(futures_train + futures_inference)
```

**`nemo_rl/models/generation/vllm/vllm_backend.py` — `init_collective()`:**
```python
def init_collective(self, rank_prefix, ip, port, world_size, train_world_size):
    from vllm.distributed.device_communicators.pynccl import PyNcclCommunicator
    # Always creates NCCL communicator, even when inference has only 1 GPU
    self.model_update_group = PyNcclCommunicator(pg, device=self.device)
```

## Proposed fix

When `inference_world_size <= 1`:
1. **`grpo.py setup()`**: Skip `init_collective()` entirely, set a flag
2. **`grpo.py refit_policy_generation()`**: Use the ZMQ/IPC weight transfer path (same as colocated) instead of NCCL broadcast
3. **`vllm_backend.py init_collective()`**: Guard — if `inference_world_size <= 1`, set `self.model_update_group = None` and return
4. **`vllm_backend.py update_weights_from_collective()`**: Return `True` immediately if `model_update_group is None`

We have a working patch script (`patch_vllm_skip_nccl.py`) that implements all four changes. Happy to submit a PR.

## Impact

- On g6e.8xlarge: Training AllReduce goes from Socket (~1 Gbps) to EFA (~100 Gbps)
- On instances with more EFA NICs: Frees QPs for other communicators, reduces NCCL init time
- No behavior change for `inference_world_size > 1` or `colocated_inference = true`

## Environment

- NeMo RL v0.5.0 (NGC container `nvcr.io/nvidia/nemo-rl:v0.5.0`)
- Instance: g6e.8xlarge (1x L40S, 1x EFA NIC)
- Also applicable to any EFA instance with limited NIC count

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM creates unnecessary NCCL communicator when inference_world_size=1 (wastes EFA QPs) #1972

Summary

Reproduction

Code path

Proposed fix

Impact

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

vLLM creates unnecessary NCCL communicator when inference_world_size=1 (wastes EFA QPs) #1972

Description

Summary

Reproduction

Code path

Proposed fix

Impact

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions