Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
295 changes: 295 additions & 0 deletions examples/oke-bm-gpu-gb200-rocev2-dranet/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,295 @@
# OKE BM.GPU.GB200-v3.4 RoCEv2 dranet Demo

End-to-end demo of topologically-aware GPU + RoCEv2 NIC allocation using
[Dynamic Resource Allocation (DRA)](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/)
on Oracle Kubernetes Engine (OKE) with [BM.GPU.GB200-v3.4](https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm#bm-gpu) shapes.

## Context

### Shape: BM.GPU.GB200-v3.4

Each node has:

| Resource | Count | Detail |
|---|---|---|
| GPU | 4 x NVIDIA GB200 | 189 GB HBM3e, Blackwell architecture, NVLink-18 all-to-all |
| NIC | 8 x Mellanox ConnectX-8 | 400 Gb/s RoCEv2, 4x NDR per NIC |
| NUMA nodes | 2 | 2 GPUs + 4 NICs per NUMA node |

### GPU-NIC topology

On GB200, GPUs connect to the Grace CPU via **NVLink C2C** (chip-to-chip), while
NICs connect via PCIe. Because GPUs and NICs are on fundamentally different
interconnects, `nvidia-smi topo -m` reports **SYS** for every GPU-NIC pair:

| | GPU0 | GPU1 | GPU2 | GPU3 | NIC0 | NIC1 | NIC2 | NIC3 | NIC4 | NIC5 | NIC6 | NIC7 |
|------|------|------|------|------|------|------|------|------|------|------|------|------|
| GPU0 | X | NV18 | NV18 | NV18 | SYS | SYS | SYS | SYS | SYS | SYS | SYS | SYS |
| GPU1 | NV18 | X | NV18 | NV18 | SYS | SYS | SYS | SYS | SYS | SYS | SYS | SYS |
| GPU2 | NV18 | NV18 | X | NV18 | SYS | SYS | SYS | SYS | SYS | SYS | SYS | SYS |
| GPU3 | NV18 | NV18 | NV18 | X | SYS | SYS | SYS | SYS | SYS | SYS | SYS | SYS |

NIC mapping: NIC0=mlx5_0/rdma0 (NUMA 0), ... NIC3=mlx5_3/rdma3 (NUMA 0), NIC4=mlx5_5/rdma4 (NUMA 1), ... NIC7=mlx5_8/rdma7 (NUMA 1)

> **Key difference from Azure GB300:** On Azure, GPU-NIC pairs on the same NUMA
> node have **NODE** affinity. On OKE GB200, all pairs report **SYS** because the
> C2C link is not visible to the PCIe topology. Despite this, NCCL enables GDR
> via the `NCCL_NET_GDR_C2C=1` flag for NUMA-local NICs, achieving comparable
> bandwidth. The practical performance difference is NUMA-local vs cross-NUMA.

### DRA device attributes

**GPU** (driver: `gpu.nvidia.com`):

| Device | pciBusID | pcieRoot | NUMA |
|---|---|---|---|
| gpu-0 | 0008:06:00.0 | pci0008:00 | 0 |
| gpu-1 | 0009:06:00.0 | pci0009:00 | 0 |
| gpu-2 | 0018:06:00.0 | pci0018:00 | 1 |
| gpu-3 | 0019:06:00.0 | pci0019:00 | 1 |

**NIC** (driver: `dra.net`):

| Device | ifName | pciAddress | NUMA | pcieRoot |
|---|---|---|---|---|
| pci-0000-03-00-0 | rdma0 | 0000:03:00.0 | 0 | pci0000:00 |
| pci-0000-03-00-1 | rdma1 | 0000:03:00.1 | 0 | pci0000:00 |
| pci-0002-03-00-0 | rdma2 | 0002:03:00.0 | 0 | pci0002:00 |
| pci-0002-03-00-1 | rdma3 | 0002:03:00.1 | 0 | pci0002:00 |
| pci-0010-03-00-0 | rdma4 | 0010:03:00.0 | 1 | pci0010:00 |
| pci-0010-03-00-1 | rdma5 | 0010:03:00.1 | 1 | pci0010:00 |
| pci-0012-03-00-0 | rdma6 | 0012:03:00.0 | 1 | pci0012:00 |
| pci-0012-03-00-1 | rdma7 | 0012:03:00.1 | 1 | pci0012:00 |

### OKE topology attributes (oke.dra.net)

Each NIC device carries node-level RDMA topology attributes sourced from the
OCI Instance Metadata Service (`GET /opc/v2/host/`):

| Attribute | Description |
|---|---|
| `oke.dra.net/hpcIslandId` | HPC Island -- largest topology grouping (~2000 nodes) |
| `oke.dra.net/networkBlockId` | Network Block -- mid-level grouping (~64-128 nodes) |
| `oke.dra.net/localBlockId` | Local Block -- closest grouping (~8-32 nodes) |
| `oke.dra.net/rackId` | Physical rack identifier |
| `oke.dra.net/gpuMemoryFabricId` | GPU memory fabric ID (populated on GB200/GB300) |

> **Note:** Topology data must be enabled for your OCI tenancy. dranet logs
> `"Please turn on TopologyData for your Tenancy"` at startup if the `/host/`
> endpoint does not provide `rdmaTopologyData`.

### RoCEv2 and IPv6 on OKE

The ConnectX-8 NICs use **RoCEv2** (RDMA over Converged Ethernet v2). On OKE,
each RDMA NIC receives a globally-routable IPv6 address via Router Advertisement.
This address populates a routable GID in the NIC's GID table, which NCCL uses
for inter-node communication (`NCCL_IB_GID_INDEX=3`).

**Challenge:** In single-stack IPv4 Kubernetes clusters, the container runtime
sets `net.ipv6.conf.all.disable_ipv6=1` in pod namespaces. This prevents the
RA-assigned IPv6 address from being applied to RDMA NICs in the pod, leaving
only link-local GIDs (which are not routable on the OKE fabric).

**dranet fix:** The OKE cloud provider returns `EnableIPv6: true` for RDMA
devices on GPU fabric shapes. When set, dranet:

1. Soft-fails the initial IPv6 address application (EACCES due to disabled IPv6)
2. Enables IPv6 per-interface via `net.ipv6.conf.<ifname>.disable_ipv6=0`
3. Re-applies the IPv6 address, populating the routable GID at index 3

## Files

| File | Description |
|---|---|
| `resource-claim-template.yaml` | Three `ResourceClaimTemplate` objects for the three test cases |
| `mpi-job.yaml` | `MPIJob` that runs `nccl_tests/all_reduce_perf` across 2 workers |
| `resourceslice-gpu.yaml` | Live GPU `ResourceSlice` from a GB200 node (reference) |
| `resourceslice-dranet.yaml` | Live NIC `ResourceSlice` from a GB200 node (reference) |

## Installation

### 1. Uninstall the existing dranet

```bash
helm uninstall dranet -n kube-system
kubectl wait --for=delete pod -l k8s-app=dranet -n kube-system --timeout=120s
```

### 2. Install your local dranet build

Build and push your image, then install from the local Helm chart:

```bash
helm install dranet ./deployments/helm/dranet \
--namespace kube-system \
--set image.repository=<your-registry>/dranet \
--set image.tag=<your-tag> \
--set image.pullPolicy=Always
kubectl rollout status daemonset/dranet -n kube-system
```

## Usage

```bash
# Install MPI Operator (if not already installed)
kubectl apply --server-side -k "https://github.com/kubeflow/mpi-operator/manifests/overlays/standalone?ref=v0.7.0"

# Apply ResourceClaimTemplates
kubectl apply -f resource-claim-template.yaml

# Select a test case: edit mpi-job.yaml resourceClaimTemplateName to one of:
# 1nic-aligned | 2nic-aligned | 1nic-unaligned
kubectl apply -f mpi-job.yaml

# Wait for workers then stream launcher logs
kubectl wait --for=condition=ready pod \
-l training.kubeflow.org/job-name=nccl-test-dra,training.kubeflow.org/job-role=worker \
--timeout=300s
launcher=$(kubectl get pods \
-l training.kubeflow.org/job-name=nccl-test-dra,training.kubeflow.org/job-role=launcher \
-o jsonpath='{.items[0].metadata.name}')
kubectl logs -f "${launcher}"
```

## ResourceClaimTemplates

Three templates are defined, each allocating 1 GPU + N NICs per worker pod.
Update `mpi-job.yaml` `resourceClaimTemplateName:` to switch between them.

### `1nic-aligned` -- 1 GPU + 1 NIC, same NUMA

gpu-0 (`0008:06:00.0`, NUMA 0) + rdma3 (`0002:03:00.1`, NUMA 0). NCCL enables
GDR via C2C with `NCCL_NET_GDR_C2C=1`, transport: `NET/IB/0/GDRDMA(PCI)`.

### `2nic-aligned` -- 1 GPU + 2 NICs, same NUMA

gpu-0 (`0008:06:00.0`) + rdma2 + rdma3 (both NUMA 0, PCIe domain `0002`).
Doubles available RoCEv2 bandwidth and NCCL channels (8 vs 4).

### `1nic-unaligned` -- 1 GPU + 1 NIC, cross-NUMA

gpu-0 (`0008:06:00.0`, NUMA 0) + rdma4 (`0010:03:00.0`, NUMA 1). GDR is
disabled by NCCL; expect significantly lower bandwidth due to cross-NUMA memory
traffic and fewer NCCL channels (2 vs 4).

## Running the full test suite

Each test requires deleting the previous MPIJob since the resource claims are
immutable. Between tests, orphaned NICs may need PCI rebinding (see next section).

```bash
# --- Test 1: 1nic-aligned ---
# Ensure resourceClaimTemplateName: 1nic-aligned in mpi-job.yaml
kubectl apply -f resource-claim-template.yaml
kubectl apply -f mpi-job.yaml
# Wait for results ...
kubectl delete mpijob nccl-test-dra

# --- Recover orphaned NICs before next test ---
# See "Recovering orphaned RDMA NICs" below

# --- Test 2: 1nic-unaligned ---
# Edit mpi-job.yaml: resourceClaimTemplateName: 1nic-unaligned
kubectl apply -f mpi-job.yaml
# Wait for results ...
kubectl delete mpijob nccl-test-dra

# --- Recover orphaned NICs before next test ---

# --- Test 3: 2nic-aligned ---
# Edit mpi-job.yaml: resourceClaimTemplateName: 2nic-aligned
kubectl apply -f mpi-job.yaml
# Wait for results ...
kubectl delete mpijob nccl-test-dra
```

## Recovering orphaned RDMA NICs

When a pod is deleted, dranet may not return the RDMA NIC from the pod namespace
to the host namespace. The NIC disappears from both the host and the ResourceSlice
(`ifName: null, rdma: false`). This is a pre-existing dranet bug, not
OKE-specific.

**Symptoms:** Workers stuck in `Pending` with `cannot allocate all claims`.

**Check which NICs are missing:**

```bash
kubectl get resourceslice -o json | python3 -c "
import json, sys
data = json.load(sys.stdin)
for rs in data['items']:
if rs['spec'].get('driver') != 'dra.net': continue
node = rs['spec']['nodeName']
for d in rs['spec'].get('devices', []):
attrs = d.get('attributes', {})
if attrs.get('dra.net/rdma', {}).get('bool') and \
not attrs.get('dra.net/virtual', {}).get('bool', True):
print(f'{node}: {d[\"name\"]} ifName={attrs.get(\"dra.net/ifName\", {}).get(\"string\", \"?\")}')
"
```

**Recover via PCI rebind** (requires a privileged debug pod on each GPU node):

```bash
# Start a debug pod (or use an existing one)
kubectl debug node/<node-ip> --image=busybox -it -- sh

# Inside the debug pod, rebind the orphaned NIC's PCI address:
chroot /host
echo "0002:03:00.1" > /sys/bus/pci/drivers/mlx5_core/unbind
sleep 2
echo "0002:03:00.1" > /sys/bus/pci/drivers/mlx5_core/bind
```

Common PCI addresses on BM.GPU.GB200-v3.4:

| ifName | PCI Address | NUMA |
|---|---|---|
| rdma0 | 0000:03:00.0 | 0 |
| rdma1 | 0000:03:00.1 | 0 |
| rdma2 | 0002:03:00.0 | 0 |
| rdma3 | 0002:03:00.1 | 0 |
| rdma4 | 0010:03:00.0 | 1 |
| rdma5 | 0010:03:00.1 | 1 |
| rdma6 | 0012:03:00.0 | 1 |
| rdma7 | 0012:03:00.1 | 1 |

Repeat the unbind/bind for every orphaned NIC on **every GPU node**. Wait ~15
seconds for dranet to rescan, then verify the NIC reappears in the ResourceSlice.

## Benchmark Results

2-node `all_reduce_perf` (`-b 512M -e 8G -f 2 -g 1`), 1 GPU per worker.
Transport: `NET/IB/GDRDMA(PCI)` for NUMA-aligned, `NET/IB` for cross-NUMA.

| Template | GPU | NIC(s) | NUMA relation | Channels | GDR | Avg busbw |
|---|---|---|---|---|---|---|
| `1nic-aligned` | gpu-0 (NUMA 0) | rdma3 (NUMA 0) | same | 4 | yes | **~46 GB/s** |
| `2nic-aligned` | gpu-0 (NUMA 0) | rdma2 + rdma3 (NUMA 0) | same | 8 | yes | **~96 GB/s** |
| `1nic-unaligned` | gpu-0 (NUMA 0) | rdma4 (NUMA 1) | cross | 2 | no | **~25 GB/s** |

### Key observations

**NUMA alignment enables GDR (~1.7x):**
Cross-NUMA placement degrades performance from ~42 GB/s to ~25 GB/s with the
same NIC count. Two compounding penalties:

1. **GDR disabled** -- NCCL falls back from `GDRDMA(PCI)` to staging through
host memory when the NIC is on a different NUMA node from the GPU. On GB200
this is controlled by `NCCL_NET_GDR_C2C=1` which only enables GDR when NCCL
detects a viable C2C path (same NUMA node).
2. **Fewer channels** -- NCCL allocates 2 channels for cross-NUMA NICs vs 4
for NUMA-local NICs.

**2 NICs doubles bandwidth (~2.2x):**
Adding a second NUMA-aligned NIC increases bandwidth from ~42 GB/s to ~91 GB/s.
NCCL doubles the channel count (8 vs 4) and stripes data across both NICs.
The `count: 2` + CEL-selector pattern in the `2nic-aligned` template is the
idiomatic DRA approach for multi-device allocation.

**Isolation confirmed:**
In all cases, the pod sees only the allocated `/dev/infiniband/uverbs*` and
`/dev/infiniband/umad*` devices -- without `privileged: true`. Isolation is
enforced by the dranet NRI plugin injecting only the char devices that correspond
to the DRA-allocated NIC(s).
92 changes: 92 additions & 0 deletions examples/oke-bm-gpu-gb200-rocev2-dranet/mpi-job.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: nccl-test-dra
spec:
slotsPerWorker: 1
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- name: nccl
image: iad.ocir.io/idxzjcdglx2s/nccl-tests:cuda-13.1.1-ubuntu-24.04-nccl-2.29.3-020926.1
command: ["/bin/bash", "-c"]
args:
- |
NUM_GPUS=1
NUM_HOSTS=$(sed -n '$=' /etc/mpi/hostfile)
NP=$(($NUM_HOSTS*$NUM_GPUS))
while ! (for host in $(awk '{print $1}' /etc/mpi/hostfile); do
ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no $host exit 2>/dev/null || exit 1
done); do
echo "Waiting for workers to be ready..."
sleep 5
done
echo "All workers ready, launching NCCL test across $NUM_HOSTS nodes ($NP ranks)"
mpirun \
--allow-run-as-root \
--bind-to numa \
--mca pml ucx \
--mca coll ^hcoll \
-x LD_LIBRARY_PATH \
-x UCX_NET_DEVICES=eth0 \
-x NCCL_DEBUG=INFO \
-x NCCL_SOCKET_IFNAME=eth0 \
-x NCCL_MNNVL_ENABLE=0 \
-x NCCL_NET_GDR_C2C=1 \
-x NCCL_IB_GID_INDEX=3 \
-x NCCL_IB_TC=41 \
-x NCCL_IB_SL=0 \
-x NCCL_IB_TIMEOUT=22 \
-x RX_QUEUE_LEN=8192 \
-x IB_RX_QUEUE_LEN=8192 \
-x NCCL_IB_QPS_PER_CONNECTION=4 \
-x NCCL_IB_SPLIT_DATA_ON_QPS=0 \
-x NCCL_BUFFSIZE=16777216 \
-x NCCL_DMABUF_ENABLE=1 \
-x NCCL_NET_PLUGIN=sys \
-x NCCL_NVLS_ENABLE=0 \
-x HCOLL_ENABLE_MCAST_ALL=0 \
-x coll_hcoll_enable=0 \
-np $NP \
/workspace/nccl-tests/build/all_reduce_perf -b 512M -e 8G -f 2 -g 1 -c 0
Worker:
replicas: 2
template:
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
training.kubeflow.org/job-name: nccl-test-dra
training.kubeflow.org/job-role: worker
topologyKey: kubernetes.io/hostname
automountServiceAccountToken: false
volumes:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 32Gi
resourceClaims:
- name: gpu-nic
resourceClaimTemplateName: 1nic-aligned
containers:
- name: nccl
image: iad.ocir.io/idxzjcdglx2s/nccl-tests:cuda-13.1.1-ubuntu-24.04-nccl-2.29.3-020926.1
volumeMounts:
- mountPath: /dev/shm
name: shm
resources:
claims:
- name: gpu-nic
securityContext:
capabilities:
add:
- IPC_LOCK
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
Loading
Loading