Skip to content

Feat: Add real topology data and Working ipv6.#138

Open
dkennetzoracle wants to merge 1 commit intokubernetes-sigs:mainfrom
dkennetzoracle:oci_topo
Open

Feat: Add real topology data and Working ipv6.#138
dkennetzoracle wants to merge 1 commit intokubernetes-sigs:mainfrom
dkennetzoracle:oci_topo

Conversation

@dkennetzoracle
Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

Joint PR (happy to split up, just wanted to get eyes on it and work with you all to do so).

  1. OKE IMDS topology data - (HPC island, network block, local block, rack, GPU memory fabric) is exposed as device attributes for topology-aware scheduling.
  2. OKE Example GB200 - adds an example doc for GB200 with performance numbers and importance of aligned NIC selection
  3. OCI's RA daemon injects IPv6 default routes - onto every RDMA NIC, causing them to be misclassified as uplink interfaces and filtered from discovery. A new NonUplinkChecker interface allows the OKE provider to override this for RDMA devices on GPU fabric shapes.
  4. Pod namespaces have IPv6 disabled (net.ipv6.conf.all.disable_ipv6=1), but RoCEv2 on OKE requires a globally-routable IPv6 address on RDMA interfaces to populate a routable GID (GID index >= 2). A new EnableIPv6 field on InterfaceConfig triggers per-interface IPv6 enablement in the pod namespace via sysctl override, followed by re-application of the IPv6 address.

Which issue(s) this PR is related to:

#111 - doesn't close

Special notes for your reviewer:

Note: Again, happy to split this up, but did want to get it out to start the process if needed. dranet on OKE won't work for all GPU SKUs without this.

Why OKE needs these changes but AKS and GCE do not

The IPv6 and uplink-filtering changes are driven by OKE's bare-metal
infrastructure model, which differs fundamentally from how AKS and GCE expose
RDMA networking:

AKS (Azure): GB300 nodes use InfiniBand VFs, not RoCEv2. IB GIDs are derived
from the port GUID (a hardware identifier burned into the HCA), not from IP
addresses. There is no dependency on IPv6 addressing for GID generation --
ibv_modify_qp resolves paths using GUID-based GIDs and the IB subnet manager.
Because IB VFs have no Ethernet netdev, there is no RA daemon, no IPv6 address,
and no default-route side-effect to filter around. The dranet IB-only path
(char-device injection without netdev movement) handles this cleanly.

GCE (Google Cloud): GPU nodes use SR-IOV VFs for RDMA. The VFs are presented
as standard Ethernet NICs with IPv4 addresses managed by the GCE metadata
service. There is no RA daemon injecting IPv6 routes, and GIDs (when RoCEv2 is
used) are populated from the IPv4 address. The pod namespace does not need IPv6
enabled because the GID is IPv4-derived.

OKE (Oracle Cloud) -- why it is different: BM.GPU.GB200-v3.4 are bare-metal
shapes. The 8 ConnectX-8 NICs are full physical functions (PFs), not SR-IOV VFs.
OCI manages the RDMA fabric at the infrastructure level by running an RA
(Router Advertisement) daemon on the host that:

  1. Assigns a globally-routable IPv6 address to each RDMA NIC (e.g.
    fdcd:0:2a29:3032:364e:3a64:6758:428a/64). This address populates a RoCEv2
    GID at index >= 2 in the NIC's GID table. The OCI fabric routes traffic using
    these IPv6-derived GIDs -- link-local GIDs (fe80::) are not routable
    (ENETUNREACH on ibv_modify_qp).

  2. Injects an IPv6 default route via RA (proto ra) into the main routing table
    for every RDMA NIC. This is an infrastructure side-effect, not a signal that
    the NIC is an uplink/transit interface. Without the NonUplinkChecker
    override, dranet's gateway-interface filter removes these NICs from the
    ResourceSlice entirely.

The IPv6 problem compounds because Kubernetes single-stack IPv4 clusters (the
default on OKE) configure net.ipv6.conf.all.disable_ipv6=1 in pod network
namespaces. When dranet moves the RDMA NIC into the pod namespace and attempts
to apply the RA-assigned IPv6 address, the kernel returns EACCES. Without the
address, the NIC has no routable GID and NCCL falls back to TCP or fails with
ENETUNREACH.

The EnableIPv6 mechanism solves this with a per-interface sysctl override
(net.ipv6.conf.<ifname>.disable_ipv6=0) that enables IPv6 on just the RDMA
NIC without affecting the pod's primary network interface. This is safe because
the RDMA NIC is isolated in the pod namespace and the IPv6 address is only used
for GID generation, not for IP-level routing.

None of this applies to AKS (IB, GUID-based GIDs, no netdev) or GCE (VFs,
IPv4-based GIDs, no RA daemon).

Architecture context

On GB200, GPUs connect to the Grace CPU via NVLink C2C while NICs connect via
PCIe. nvidia-smi topo -m reports SYS for all GPU-NIC pairs. Despite this,
NCCL enables GDR for NUMA-local NICs via NCCL_NET_GDR_C2C=1. The meaningful
topology distinction is same-NUMA vs cross-NUMA, not PCIe host bridge alignment.

IPv6 flow

The EnableIPv6 mechanism is intentionally outside nsAttachNetdev to keep
that function general-purpose. The sequence is:

  1. nsAttachNetdev moves NIC to pod namespace, soft-fails IPv6 address (EACCES)
  2. attachNetdevToNS enables IPv6 per-interface sysctl, re-applies IPv6 address
  3. RDMA device generates routable GID from the applied IPv6 address
Known issue -- NIC orphaning - #137

RDMA NICs are not reliably returned to the host namespace on pod deletion. This
is a pre-existing CRI-O / dranet bug (not introduced here). The example README documents
the PCI rebind recovery procedure.

Benchmark results

2-node all_reduce_perf, 1 GPU/worker, BM.GPU.GB200-v3.4:

Template NIC(s) NUMA Channels GDR Avg busbw
1nic-aligned rdma3 (NUMA 0) same 4 yes ~46 GB/s
2nic-aligned rdma2+rdma3 (NUMA 0) same 8 yes ~96 GB/s
1nic-unaligned rdma4 (NUMA 1) cross 2 no ~25 GB/s

Does this PR introduce a user-facing change?

Add OKE cloud provider support for BM.GPU.GB200 RoCEv2 RDMA networking. dranet
now correctly discovers RDMA NICs on OKE GPU shapes (despite RA-injected default
routes), enables per-interface IPv6 in pod namespaces for routable RoCEv2 GID
generation, and exposes OCI RDMA topology attributes (HPC island, network block,
local block, rack, GPU memory fabric) for topology-aware scheduling.

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 8, 2026
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 8, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @dkennetzoracle. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 8, 2026
@dkennetzoracle
Copy link
Copy Markdown
Contributor Author

Ah, I am skipping too much ipv6 traffic. It's going to be hard for me to get access to a gb again. Maybe I split this up and change the example to an H100 until I can get access to another gb?

@gauravkghildiyal
Copy link
Copy Markdown
Member

Thanks for the change @dkennetzoracle. I've not looked into the changes themselves, but just skimming through the description, wanted to share that what you're describing sounds exactly like GCPs GB300 bare metal (public docs) which utilizes DRANET. I still expect there to be differences between platforms, but one of the first thoughts that came to my mind was that if OKE supports dual-stack IPs, the disable_ipv6 sysctl should get enabled as part of your dual-stack enablement. Curious if that's not the case

@dkennetzoracle
Copy link
Copy Markdown
Contributor Author

@gauravkghildiyal thanks for clarifying that GCP runs BMs, I didn't know this! OKE does support dual-stack, the cluster I was testing on was just running single stack. I've reached out internally to see if we could get another one setup with dual stack which may alleviate a lot of this.

Shape: "BM.GPU.H100.8",
FaultDomain: "FAULT-DOMAIN-1",
AvailabilityDomain: "TrcQ:US-ASHBURN-AD-2",
HPCIslandId: "fake-island-id",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add test case to cover new code path where GpuMemoryFabric != "" and id.RDMA == true should return a config with EnableIPv6: true

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

const (
OKEAttrPrefix = "oke.dra.net"

AttrOKEShape = OKEAttrPrefix + "/" + "shape"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing these fields could cause existing usage to break, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They will, but they were placeholders. The API call made there doesn't return anything useful for topology awareness and is a separate api call then the topology data one. To maintain both, I'd need to call out to 2 APIs, which is fine and still provides some value because it gives us the PCIe of the NICs purely from dranet but it doesn't provide any value beyond that (for topology). My preference for someone on OCI to fail loudly if topology awareness is not enabled, rather than just logging it and moving on, which is why I removed the old stuff.

This is still very new on the OCI side, so I don't expect to have any users for our cloud yet so I'm not concerned with the backwards compatibility aspect.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense, thanks for the clarification!

if resp.StatusCode != http.StatusOK {
klog.Infof("OCI IMDS returned status %d ... retrying", resp.StatusCode)
return false, nil
return false, fmt.Errorf("Please turn on TopologyData for your Tenancy")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would there be any other error can be retried, e.g. network error?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified this to retry rather than return an error. If the context deadline is exceeded, it'll error.

// OCIDs are ~90+ characters; the suffix is always 60 characters and is unique
// per resource within a tenancy, making it safe to use as an attribute value.
// Non-OCID strings (e.g. the rackId hex hash) are returned unchanged.
func ocidSuffix(s string) string {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add OCID validation: e.g. 60 characters length, string contains '.'?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the empty string return because GpuMemoryFabric is only present for rack-scale shapes like Grace Blackwell. Basically, if GpuMemoryFabric exists, it should proceed through the validation. If not, it should just get skipped.

@dkennetzoracle
Copy link
Copy Markdown
Contributor Author

Thanks for the review @anson627 - I am addressing some of the feedback now!

@aojea
Copy link
Copy Markdown
Contributor

aojea commented Apr 9, 2026

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 9, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: anson627, dkennetzoracle
Once this PR has been reviewed and has the lgtm label, please assign bowei for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@anson627
Copy link
Copy Markdown
Contributor

anson627 commented Apr 9, 2026

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 9, 2026
Comment on lines +571 to +574
// Discard IPv6 RA-assigned routes. These are injected by the cloud
// provider's RA daemon as an infrastructure side-effect (e.g. OCI on
// RDMA NICs) and must not be propagated into pod namespaces.
if route.Protocol == unix.RTPROT_RA {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we can generalize this pattern

Comment on lines +33 to +41
// NonUplinkChecker is an optional extension to CloudInstance. A provider may
// implement this to exempt specific device classes from the default-gateway
// uplink filter, even when those interfaces carry a default route. This is
// needed on platforms where infrastructure daemons (e.g. RA) inject default
// routes onto workload RDMA NICs as a side-effect of their setup.
type NonUplinkChecker interface {
// IsNonUplink returns true if the device should be included in the
// ResourceSlice regardless of any default gateway route on that interface.
IsNonUplink(id DeviceIdentifiers) bool
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a flag for providers to implement custom filtering based on attributes, can't we use that?

This model of special casting by creating interfaces does not look very sustainable long term

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A challenge in this case with using the FilterDevices was that the CEL filter runs after the gwInterfaces exclusion so by the time FilterDevices sees the devices, the RDMA NICs with default RA routes have already been removed so the filter can't include them back..

Should probably relate to your comment about gwInterfaces and improving that.

// net.ipv6.conf.all.disable_ipv6=1). Soft-fail so that the caller can enable
// IPv6 per-interface and re-apply the address after moving the device.
if ip.To4() == nil && (errors.Is(err, unix.EACCES) || errors.Is(err, unix.EPERM)) {
klog.V(4).Infof("skipping IPv6 address %s on %s in namespace %s: IPv6 is disabled (will retry after per-interface enable)", address, nsLink.Attrs().Name, containerNsPAth)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember right now the exactl logic, where is this retried?

Comment on lines +278 to +291
// If EnableIPv6 is set, enable IPv6 per-interface and re-apply any IPv6 addresses
// that were skipped in nsAttachNetdev because IPv6 was globally disabled in the pod
// namespace. This is needed on platforms such as OKE where RDMA NICs require a
// globally-routable IPv6 address to populate a routable RoCEv2 GID.
if config.NetworkInterfaceConfigInPod.Interface.EnableIPv6 != nil &&
*config.NetworkInterfaceConfigInPod.Interface.EnableIPv6 {
if err := enableIPv6ForInterface(ns, ifNameInNs); err != nil {
return fmt.Errorf("failed to enable IPv6 for %s in namespace %s: %w", ifNameInNs, ns, err)
}
if err := reapplyIPv6Addresses(ns, ifNameInNs, config.NetworkInterfaceConfigInPod.Interface.Addresses); err != nil {
return fmt.Errorf("failed to re-apply IPv6 addresses for %s in namespace %s: %w", ifNameInNs, ns, err)
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I see, https://github.com/kubernetes-sigs/dranet/pull/138/changes#r3063956282

can't we do better and set the sysctl before the addresses?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point. We should be able to cut out the reapply by just doing:

enableIPv6ForInterface()
nsAttachNetdev()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh but that has some problems. Currently enableIPv6ForInterface operates inside the pod ns on the interface after it's been moved so we'd need to restructure so the sysctl is set on the interface name inside the pod netns before address application.

nsAttachNetdev does move + address in one call, so we'd either need to:

  • split nsAttachNetDev into move + configure phases
  • have it accept a pre-configure callback
  • enable IPV6 per-instance inside nsAttachNetDev when EnableIPv6 is set before applying addresses

The last one is probably easiest.

Copy link
Copy Markdown
Contributor

@anson627 anson627 Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree handling EnableIPv6 inside nsAttachNetdev directly — since it already does move + address in one call, this should avoid the soft-fail + reapply dance entirely

Comment on lines +237 to +253
if hasChecker {
id := cloudprovider.DeviceIdentifiers{Name: device.Name}
if macAttr, ok := device.Attributes[apis.AttrMac]; ok && macAttr.StringValue != nil {
id.MAC = *macAttr.StringValue
}
if pciAttr, ok := device.Attributes[apis.AttrPCIAddress]; ok && pciAttr.StringValue != nil {
id.PCIAddress = *pciAttr.StringValue
}
if rdmaAttr, ok := device.Attributes[apis.AttrRDMA]; ok && rdmaAttr.BoolValue != nil {
id.RDMA = *rdmaAttr.BoolValue
}
if checker.IsNonUplink(id) {
klog.V(4).Infof("Interface %s has a default gateway route but is classified as a non-uplink by the cloud provider; including in discovery", *ifName)
filteredDevices = append(filteredDevices, device)
continue
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the thing I prefer to think more about it... the gwInterfaces was added based on an assumption that it seems to no longer be true, can we make the gwInterfaces detection logic more resilient? I want us to think in a better architecture that is sustainable long term rather in just solving the problem at hand

@aojea
Copy link
Copy Markdown
Contributor

aojea commented Apr 10, 2026

/hold

I'm +1 on the PR but I want us to think more holistically, holding to avoid an unexpected merge?

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 10, 2026
@dkennetzoracle
Copy link
Copy Markdown
Contributor Author

@aojea thanks for the review! To your point:

I'm +1 on the PR but I want us to think more holistically, holding to avoid an unexpected merge?

I agree with this and many of the things added in the PR feel like patches rather than design decisions. I just can't test on other systems outside of OCI, so it is hard for me to abstract some of this stuff. I'll spend some time reviewing and provide some feedback for what could be a bit more robust.

@anson627
Copy link
Copy Markdown
Contributor

anson627 commented Apr 10, 2026

@aojea thanks for the review! To your point:

I'm +1 on the PR but I want us to think more holistically, holding to avoid an unexpected merge?

I agree with this and many of the things added in the PR feel like patches rather than design decisions. I just can't test on other systems outside of OCI, so it is hard for me to abstract some of this stuff. I'll spend some time reviewing and provide some feedback for what could be a bit more robust.

maybe we split OCI specific part out into a separate PR, which is relatively straightforward and can merge quickly, and spend some more time on refactoring & testing changes on the common code path

@dkennetzoracle
Copy link
Copy Markdown
Contributor Author

@aojea - aligned! I thought that might be the outcome of the PR. Are you okay with me including the OCI + example in this PR even though example isn't reproducible until we get ipv6 stuff in? Or should I put the example in the ipv6 PR?

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 13, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@aojea
Copy link
Copy Markdown
Contributor

aojea commented Apr 14, 2026

@dkennetzoracle we are planning to do a release soon? is this ipv6 something you will want as part of the release?

@dkennetzoracle
Copy link
Copy Markdown
Contributor Author

@aojea - possibly, how long are the release cycles generally? If I missed this one, how long until the next one?

@aojea
Copy link
Copy Markdown
Contributor

aojea commented Apr 14, 2026

@aojea - possibly, how long are the release cycles generally? If I missed this one, how long until the next one?

right now releases are ad hoc, in this case the only friction is about the mltiple gateway UX, the EnableIPv6 sounds good, is clearly an interface option and we have other sysctl interface options ... I think we can improve the implementation to be more streamlined and not rely on retries

The multiple gateway thing I prefer we put some more thought on it, the original idea is that we do make it sane by default, so no customer breaks their nodes by putting the vm interface on the pod ... but with the situation you are describing is unclear to me if we can be smart about it ... I feel like we should be able to

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants