Feat: Add real topology data and Working ipv6. by dkennetzoracle · Pull Request #138 · kubernetes-sigs/dranet

dkennetzoracle · 2026-04-08T05:34:57Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Joint PR (happy to split up, just wanted to get eyes on it and work with you all to do so).

OKE IMDS topology data - (HPC island, network block, local block, rack, GPU memory fabric) is exposed as device attributes for topology-aware scheduling.
OKE Example GB200 - adds an example doc for GB200 with performance numbers and importance of aligned NIC selection
OCI's RA daemon injects IPv6 default routes - onto every RDMA NIC, causing them to be misclassified as uplink interfaces and filtered from discovery. A new NonUplinkChecker interface allows the OKE provider to override this for RDMA devices on GPU fabric shapes.
Pod namespaces have IPv6 disabled (net.ipv6.conf.all.disable_ipv6=1), but RoCEv2 on OKE requires a globally-routable IPv6 address on RDMA interfaces to populate a routable GID (GID index >= 2). A new EnableIPv6 field on InterfaceConfig triggers per-interface IPv6 enablement in the pod namespace via sysctl override, followed by re-application of the IPv6 address.

Which issue(s) this PR is related to:

#111 - doesn't close

Special notes for your reviewer:

Note: Again, happy to split this up, but did want to get it out to start the process if needed. dranet on OKE won't work for all GPU SKUs without this.

Why OKE needs these changes but AKS and GCE do not

The IPv6 and uplink-filtering changes are driven by OKE's bare-metal
infrastructure model, which differs fundamentally from how AKS and GCE expose
RDMA networking:

AKS (Azure): GB300 nodes use InfiniBand VFs, not RoCEv2. IB GIDs are derived
from the port GUID (a hardware identifier burned into the HCA), not from IP
addresses. There is no dependency on IPv6 addressing for GID generation --
ibv_modify_qp resolves paths using GUID-based GIDs and the IB subnet manager.
Because IB VFs have no Ethernet netdev, there is no RA daemon, no IPv6 address,
and no default-route side-effect to filter around. The dranet IB-only path
(char-device injection without netdev movement) handles this cleanly.

GCE (Google Cloud): GPU nodes use SR-IOV VFs for RDMA. The VFs are presented
as standard Ethernet NICs with IPv4 addresses managed by the GCE metadata
service. There is no RA daemon injecting IPv6 routes, and GIDs (when RoCEv2 is
used) are populated from the IPv4 address. The pod namespace does not need IPv6
enabled because the GID is IPv4-derived.

OKE (Oracle Cloud) -- why it is different: BM.GPU.GB200-v3.4 are bare-metal
shapes. The 8 ConnectX-8 NICs are full physical functions (PFs), not SR-IOV VFs.
OCI manages the RDMA fabric at the infrastructure level by running an RA
(Router Advertisement) daemon on the host that:

Assigns a globally-routable IPv6 address to each RDMA NIC (e.g.
fdcd:0:2a29:3032:364e:3a64:6758:428a/64). This address populates a RoCEv2
GID at index >= 2 in the NIC's GID table. The OCI fabric routes traffic using
these IPv6-derived GIDs -- link-local GIDs (fe80::) are not routable
(ENETUNREACH on ibv_modify_qp).
Injects an IPv6 default route via RA (proto ra) into the main routing table
for every RDMA NIC. This is an infrastructure side-effect, not a signal that
the NIC is an uplink/transit interface. Without the NonUplinkChecker
override, dranet's gateway-interface filter removes these NICs from the
ResourceSlice entirely.

The IPv6 problem compounds because Kubernetes single-stack IPv4 clusters (the
default on OKE) configure net.ipv6.conf.all.disable_ipv6=1 in pod network
namespaces. When dranet moves the RDMA NIC into the pod namespace and attempts
to apply the RA-assigned IPv6 address, the kernel returns EACCES. Without the
address, the NIC has no routable GID and NCCL falls back to TCP or fails with
ENETUNREACH.

The EnableIPv6 mechanism solves this with a per-interface sysctl override
(net.ipv6.conf.<ifname>.disable_ipv6=0) that enables IPv6 on just the RDMA
NIC without affecting the pod's primary network interface. This is safe because
the RDMA NIC is isolated in the pod namespace and the IPv6 address is only used
for GID generation, not for IP-level routing.

None of this applies to AKS (IB, GUID-based GIDs, no netdev) or GCE (VFs,
IPv4-based GIDs, no RA daemon).

Architecture context

On GB200, GPUs connect to the Grace CPU via NVLink C2C while NICs connect via
PCIe. nvidia-smi topo -m reports SYS for all GPU-NIC pairs. Despite this,
NCCL enables GDR for NUMA-local NICs via NCCL_NET_GDR_C2C=1. The meaningful
topology distinction is same-NUMA vs cross-NUMA, not PCIe host bridge alignment.

IPv6 flow

The EnableIPv6 mechanism is intentionally outside nsAttachNetdev to keep
that function general-purpose. The sequence is:

nsAttachNetdev moves NIC to pod namespace, soft-fails IPv6 address (EACCES)
attachNetdevToNS enables IPv6 per-interface sysctl, re-applies IPv6 address
RDMA device generates routable GID from the applied IPv6 address

Known issue -- NIC orphaning - #137

RDMA NICs are not reliably returned to the host namespace on pod deletion. This
is a pre-existing CRI-O / dranet bug (not introduced here). The example README documents
the PCI rebind recovery procedure.

Benchmark results

2-node all_reduce_perf, 1 GPU/worker, BM.GPU.GB200-v3.4:

Template	NIC(s)	NUMA	Channels	GDR	Avg busbw
`1nic-aligned`	rdma3 (NUMA 0)	same	4	yes	~46 GB/s
`2nic-aligned`	rdma2+rdma3 (NUMA 0)	same	8	yes	~96 GB/s
`1nic-unaligned`	rdma4 (NUMA 1)	cross	2	no	~25 GB/s

Does this PR introduce a user-facing change?

Add OKE cloud provider support for BM.GPU.GB200 RoCEv2 RDMA networking. dranet
now correctly discovers RDMA NICs on OKE GPU shapes (despite RA-injected default
routes), enables per-interface IPv6 in pod namespaces for routable RoCEv2 GID
generation, and exposes OCI RDMA topology attributes (HPC island, network block,
local block, rack, GPU memory fabric) for topology-aware scheduling.

k8s-ci-robot · 2026-04-08T05:35:08Z

Hi @dkennetzoracle. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

dkennetzoracle · 2026-04-08T05:57:20Z

Ah, I am skipping too much ipv6 traffic. It's going to be hard for me to get access to a gb again. Maybe I split this up and change the example to an H100 until I can get access to another gb?

gauravkghildiyal · 2026-04-08T07:05:22Z

Thanks for the change @dkennetzoracle. I've not looked into the changes themselves, but just skimming through the description, wanted to share that what you're describing sounds exactly like GCPs GB300 bare metal (public docs) which utilizes DRANET. I still expect there to be differences between platforms, but one of the first thoughts that came to my mind was that if OKE supports dual-stack IPs, the disable_ipv6 sysctl should get enabled as part of your dual-stack enablement. Curious if that's not the case

dkennetzoracle · 2026-04-08T14:30:22Z

@gauravkghildiyal thanks for clarifying that GCP runs BMs, I didn't know this! OKE does support dual-stack, the cluster I was testing on was just running single stack. I've reached out internally to see if we could get another one setup with dual stack which may alleviate a lot of this.

todo.md

anson627 · 2026-04-08T16:52:29Z

pkg/cloudprovider/oke/oke_test.go

-		Shape:              "BM.GPU.H100.8",
-		FaultDomain:        "FAULT-DOMAIN-1",
-		AvailabilityDomain: "TrcQ:US-ASHBURN-AD-2",
+		HPCIslandId:    "fake-island-id",


add test case to cover new code path where GpuMemoryFabric != "" and id.RDMA == true should return a config with EnableIPv6: true

anson627 · 2026-04-08T16:53:23Z

pkg/cloudprovider/oke/oke.go

 const (
 	OKEAttrPrefix = "oke.dra.net"

-	AttrOKEShape              = OKEAttrPrefix + "/" + "shape"


removing these fields could cause existing usage to break, right?

They will, but they were placeholders. The API call made there doesn't return anything useful for topology awareness and is a separate api call then the topology data one. To maintain both, I'd need to call out to 2 APIs, which is fine and still provides some value because it gives us the PCIe of the NICs purely from dranet but it doesn't provide any value beyond that (for topology). My preference for someone on OCI to fail loudly if topology awareness is not enabled, rather than just logging it and moving on, which is why I removed the old stuff.

This is still very new on the OCI side, so I don't expect to have any users for our cloud yet so I'm not concerned with the backwards compatibility aspect.

make sense, thanks for the clarification!

anson627 · 2026-04-08T16:55:06Z

pkg/cloudprovider/oke/oke.go

 		if resp.StatusCode != http.StatusOK {
-			klog.Infof("OCI IMDS returned status %d ... retrying", resp.StatusCode)
-			return false, nil
+			return false, fmt.Errorf("Please turn on TopologyData for your Tenancy")


would there be any other error can be retried, e.g. network error?

I modified this to retry rather than return an error. If the context deadline is exceeded, it'll error.

anson627 · 2026-04-08T16:58:18Z

pkg/cloudprovider/oke/oke.go

+// OCIDs are ~90+ characters; the suffix is always 60 characters and is unique
+// per resource within a tenancy, making it safe to use as an attribute value.
+// Non-OCID strings (e.g. the rackId hex hash) are returned unchanged.
+func ocidSuffix(s string) string {


nit: add OCID validation: e.g. 60 characters length, string contains '.'?

Adding the empty string return because GpuMemoryFabric is only present for rack-scale shapes like Grace Blackwell. Basically, if GpuMemoryFabric exists, it should proceed through the validation. If not, it should just get skipped.

dkennetzoracle · 2026-04-09T14:34:16Z

Thanks for the review @anson627 - I am addressing some of the feedback now!

aojea · 2026-04-09T19:57:38Z

/ok-to-test

k8s-ci-robot · 2026-04-09T21:45:37Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: anson627, dkennetzoracle
Once this PR has been reviewed and has the lgtm label, please assign bowei for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

anson627 · 2026-04-09T23:45:17Z

/lgtm

aojea · 2026-04-10T11:34:15Z

pkg/driver/dra_hooks.go

+			// Discard IPv6 RA-assigned routes. These are injected by the cloud
+			// provider's RA daemon as an infrastructure side-effect (e.g. OCI on
+			// RDMA NICs) and must not be propagated into pod namespaces.
+			if route.Protocol == unix.RTPROT_RA {


I'm not sure we can generalize this pattern

aojea · 2026-04-10T11:36:41Z

pkg/cloudprovider/cloud.go

+// NonUplinkChecker is an optional extension to CloudInstance. A provider may
+// implement this to exempt specific device classes from the default-gateway
+// uplink filter, even when those interfaces carry a default route. This is
+// needed on platforms where infrastructure daemons (e.g. RA) inject default
+// routes onto workload RDMA NICs as a side-effect of their setup.
+type NonUplinkChecker interface {
+	// IsNonUplink returns true if the device should be included in the
+	// ResourceSlice regardless of any default gateway route on that interface.
+	IsNonUplink(id DeviceIdentifiers) bool


we have a flag for providers to implement custom filtering based on attributes, can't we use that?

This model of special casting by creating interfaces does not look very sustainable long term

A challenge in this case with using the FilterDevices was that the CEL filter runs after the gwInterfaces exclusion so by the time FilterDevices sees the devices, the RDMA NICs with default RA routes have already been removed so the filter can't include them back..

Should probably relate to your comment about gwInterfaces and improving that.

aojea · 2026-04-10T11:37:43Z

pkg/driver/hostdevice.go

+			// net.ipv6.conf.all.disable_ipv6=1). Soft-fail so that the caller can enable
+			// IPv6 per-interface and re-apply the address after moving the device.
+			if ip.To4() == nil && (errors.Is(err, unix.EACCES) || errors.Is(err, unix.EPERM)) {
+				klog.V(4).Infof("skipping IPv6 address %s on %s in namespace %s: IPv6 is disabled (will retry after per-interface enable)", address, nsLink.Attrs().Name, containerNsPAth)


I can't remember right now the exactl logic, where is this retried?

aojea · 2026-04-10T11:38:47Z

pkg/driver/nri_hooks.go

+	// If EnableIPv6 is set, enable IPv6 per-interface and re-apply any IPv6 addresses
+	// that were skipped in nsAttachNetdev because IPv6 was globally disabled in the pod
+	// namespace. This is needed on platforms such as OKE where RDMA NICs require a
+	// globally-routable IPv6 address to populate a routable RoCEv2 GID.
+	if config.NetworkInterfaceConfigInPod.Interface.EnableIPv6 != nil &&
+		*config.NetworkInterfaceConfigInPod.Interface.EnableIPv6 {
+		if err := enableIPv6ForInterface(ns, ifNameInNs); err != nil {
+			return fmt.Errorf("failed to enable IPv6 for %s in namespace %s: %w", ifNameInNs, ns, err)
+		}
+		if err := reapplyIPv6Addresses(ns, ifNameInNs, config.NetworkInterfaceConfigInPod.Interface.Addresses); err != nil {
+			return fmt.Errorf("failed to re-apply IPv6 addresses for %s in namespace %s: %w", ifNameInNs, ns, err)
+		}
+	}
+


oh, I see, https://github.com/kubernetes-sigs/dranet/pull/138/changes#r3063956282

can't we do better and set the sysctl before the addresses?

Yeah, good point. We should be able to cut out the reapply by just doing:

enableIPv6ForInterface() nsAttachNetdev()

Oh but that has some problems. Currently enableIPv6ForInterface operates inside the pod ns on the interface after it's been moved so we'd need to restructure so the sysctl is set on the interface name inside the pod netns before address application.

nsAttachNetdev does move + address in one call, so we'd either need to:

split nsAttachNetDev into move + configure phases

have it accept a pre-configure callback

enable IPV6 per-instance inside nsAttachNetDev when EnableIPv6 is set before applying addresses

The last one is probably easiest.

agree handling EnableIPv6 inside nsAttachNetdev directly — since it already does move + address in one call, this should avoid the soft-fail + reapply dance entirely

aojea · 2026-04-10T11:42:01Z

pkg/inventory/db.go

+			if hasChecker {
+				id := cloudprovider.DeviceIdentifiers{Name: device.Name}
+				if macAttr, ok := device.Attributes[apis.AttrMac]; ok && macAttr.StringValue != nil {
+					id.MAC = *macAttr.StringValue
+				}
+				if pciAttr, ok := device.Attributes[apis.AttrPCIAddress]; ok && pciAttr.StringValue != nil {
+					id.PCIAddress = *pciAttr.StringValue
+				}
+				if rdmaAttr, ok := device.Attributes[apis.AttrRDMA]; ok && rdmaAttr.BoolValue != nil {
+					id.RDMA = *rdmaAttr.BoolValue
+				}
+				if checker.IsNonUplink(id) {
+					klog.V(4).Infof("Interface %s has a default gateway route but is classified as a non-uplink by the cloud provider; including in discovery", *ifName)
+					filteredDevices = append(filteredDevices, device)
+					continue
+				}
+			}


this is the thing I prefer to think more about it... the gwInterfaces was added based on an assumption that it seems to no longer be true, can we make the gwInterfaces detection logic more resilient? I want us to think in a better architecture that is sustainable long term rather in just solving the problem at hand

aojea · 2026-04-10T11:43:19Z

/hold

I'm +1 on the PR but I want us to think more holistically, holding to avoid an unexpected merge?

dkennetzoracle · 2026-04-10T19:21:20Z

@aojea thanks for the review! To your point:

I'm +1 on the PR but I want us to think more holistically, holding to avoid an unexpected merge?

I agree with this and many of the things added in the PR feel like patches rather than design decisions. I just can't test on other systems outside of OCI, so it is hard for me to abstract some of this stuff. I'll spend some time reviewing and provide some feedback for what could be a bit more robust.

anson627 · 2026-04-10T19:51:51Z

@aojea thanks for the review! To your point:

I'm +1 on the PR but I want us to think more holistically, holding to avoid an unexpected merge?

I agree with this and many of the things added in the PR feel like patches rather than design decisions. I just can't test on other systems outside of OCI, so it is hard for me to abstract some of this stuff. I'll spend some time reviewing and provide some feedback for what could be a bit more robust.

maybe we split OCI specific part out into a separate PR, which is relatively straightforward and can merge quickly, and spend some more time on refactoring & testing changes on the common code path

dkennetzoracle · 2026-04-10T20:06:26Z

@aojea - aligned! I thought that might be the outcome of the PR. Are you okay with me including the OCI + example in this PR even though example isn't reproducible until we get ipv6 stuff in? Or should I put the example in the ipv6 PR?

k8s-ci-robot · 2026-04-13T15:46:56Z

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

aojea · 2026-04-14T18:32:41Z

@dkennetzoracle we are planning to do a release soon? is this ipv6 something you will want as part of the release?

dkennetzoracle · 2026-04-14T18:35:09Z

@aojea - possibly, how long are the release cycles generally? If I missed this one, how long until the next one?

aojea · 2026-04-14T18:42:22Z

@aojea - possibly, how long are the release cycles generally? If I missed this one, how long until the next one?

right now releases are ad hoc, in this case the only friction is about the mltiple gateway UX, the EnableIPv6 sounds good, is clearly an interface option and we have other sysctl interface options ... I think we can improve the implementation to be more streamlined and not rely on retries

The multiple gateway thing I prefer we put some more thought on it, the original idea is that we do make it sane by default, so no customer breaks their nodes by putting the vm interface on the pod ... but with the situation you are describing is unclear to me if we can be smart about it ... I feel like we should be able to

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 8, 2026

k8s-ci-robot requested review from MikeZappa87 and anson627 April 8, 2026 05:35

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 8, 2026

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 8, 2026

anson627 reviewed Apr 8, 2026

View reviewed changes

Feat: Add real topology data and Working ipv6.

02a3918

dkennetzoracle force-pushed the oci_topo branch from 36451ef to 02a3918 Compare April 9, 2026 17:06

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 9, 2026

anson627 approved these changes Apr 9, 2026

View reviewed changes

k8s-ci-robot assigned anson627 Apr 9, 2026

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 9, 2026

aojea reviewed Apr 10, 2026

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 10, 2026

dkennetzoracle mentioned this pull request Apr 13, 2026

feat: Update oke topology to match customer tenancy data, add example #142

Merged

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 13, 2026

Conversation

dkennetzoracle commented Apr 8, 2026

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR is related to:

Special notes for your reviewer:

Why OKE needs these changes but AKS and GCE do not

Architecture context

IPv6 flow

Known issue -- NIC orphaning - #137

Benchmark results

Does this PR introduce a user-facing change?

Uh oh!

k8s-ci-robot commented Apr 8, 2026

Uh oh!

dkennetzoracle commented Apr 8, 2026

Uh oh!

gauravkghildiyal commented Apr 8, 2026

Uh oh!

dkennetzoracle commented Apr 8, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dkennetzoracle commented Apr 9, 2026

Uh oh!

aojea commented Apr 9, 2026

Uh oh!

k8s-ci-robot commented Apr 9, 2026

Uh oh!

anson627 commented Apr 9, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anson627 Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aojea commented Apr 10, 2026

Uh oh!

dkennetzoracle commented Apr 10, 2026

Uh oh!

anson627 commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dkennetzoracle commented Apr 10, 2026

Uh oh!

k8s-ci-robot commented Apr 13, 2026

Uh oh!

aojea commented Apr 14, 2026

anson627 Apr 10, 2026 •

edited

Loading

anson627 commented Apr 10, 2026 •

edited

Loading