Feat: Add real topology data and Working ipv6.#138
Feat: Add real topology data and Working ipv6.#138dkennetzoracle wants to merge 1 commit intokubernetes-sigs:mainfrom
Conversation
|
Hi @dkennetzoracle. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Ah, I am skipping too much ipv6 traffic. It's going to be hard for me to get access to a gb again. Maybe I split this up and change the example to an H100 until I can get access to another gb? |
|
Thanks for the change @dkennetzoracle. I've not looked into the changes themselves, but just skimming through the description, wanted to share that what you're describing sounds exactly like GCPs GB300 bare metal (public docs) which utilizes DRANET. I still expect there to be differences between platforms, but one of the first thoughts that came to my mind was that if OKE supports dual-stack IPs, the |
|
@gauravkghildiyal thanks for clarifying that GCP runs BMs, I didn't know this! OKE does support dual-stack, the cluster I was testing on was just running single stack. I've reached out internally to see if we could get another one setup with dual stack which may alleviate a lot of this. |
pkg/cloudprovider/oke/oke_test.go
Outdated
| Shape: "BM.GPU.H100.8", | ||
| FaultDomain: "FAULT-DOMAIN-1", | ||
| AvailabilityDomain: "TrcQ:US-ASHBURN-AD-2", | ||
| HPCIslandId: "fake-island-id", |
There was a problem hiding this comment.
add test case to cover new code path where GpuMemoryFabric != "" and id.RDMA == true should return a config with EnableIPv6: true
| const ( | ||
| OKEAttrPrefix = "oke.dra.net" | ||
|
|
||
| AttrOKEShape = OKEAttrPrefix + "/" + "shape" |
There was a problem hiding this comment.
removing these fields could cause existing usage to break, right?
There was a problem hiding this comment.
They will, but they were placeholders. The API call made there doesn't return anything useful for topology awareness and is a separate api call then the topology data one. To maintain both, I'd need to call out to 2 APIs, which is fine and still provides some value because it gives us the PCIe of the NICs purely from dranet but it doesn't provide any value beyond that (for topology). My preference for someone on OCI to fail loudly if topology awareness is not enabled, rather than just logging it and moving on, which is why I removed the old stuff.
This is still very new on the OCI side, so I don't expect to have any users for our cloud yet so I'm not concerned with the backwards compatibility aspect.
There was a problem hiding this comment.
make sense, thanks for the clarification!
pkg/cloudprovider/oke/oke.go
Outdated
| if resp.StatusCode != http.StatusOK { | ||
| klog.Infof("OCI IMDS returned status %d ... retrying", resp.StatusCode) | ||
| return false, nil | ||
| return false, fmt.Errorf("Please turn on TopologyData for your Tenancy") |
There was a problem hiding this comment.
would there be any other error can be retried, e.g. network error?
There was a problem hiding this comment.
I modified this to retry rather than return an error. If the context deadline is exceeded, it'll error.
pkg/cloudprovider/oke/oke.go
Outdated
| // OCIDs are ~90+ characters; the suffix is always 60 characters and is unique | ||
| // per resource within a tenancy, making it safe to use as an attribute value. | ||
| // Non-OCID strings (e.g. the rackId hex hash) are returned unchanged. | ||
| func ocidSuffix(s string) string { |
There was a problem hiding this comment.
nit: add OCID validation: e.g. 60 characters length, string contains '.'?
There was a problem hiding this comment.
Adding the empty string return because GpuMemoryFabric is only present for rack-scale shapes like Grace Blackwell. Basically, if GpuMemoryFabric exists, it should proceed through the validation. If not, it should just get skipped.
|
Thanks for the review @anson627 - I am addressing some of the feedback now! |
|
/ok-to-test |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: anson627, dkennetzoracle The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/lgtm |
| // Discard IPv6 RA-assigned routes. These are injected by the cloud | ||
| // provider's RA daemon as an infrastructure side-effect (e.g. OCI on | ||
| // RDMA NICs) and must not be propagated into pod namespaces. | ||
| if route.Protocol == unix.RTPROT_RA { |
There was a problem hiding this comment.
I'm not sure we can generalize this pattern
| // NonUplinkChecker is an optional extension to CloudInstance. A provider may | ||
| // implement this to exempt specific device classes from the default-gateway | ||
| // uplink filter, even when those interfaces carry a default route. This is | ||
| // needed on platforms where infrastructure daemons (e.g. RA) inject default | ||
| // routes onto workload RDMA NICs as a side-effect of their setup. | ||
| type NonUplinkChecker interface { | ||
| // IsNonUplink returns true if the device should be included in the | ||
| // ResourceSlice regardless of any default gateway route on that interface. | ||
| IsNonUplink(id DeviceIdentifiers) bool |
There was a problem hiding this comment.
we have a flag for providers to implement custom filtering based on attributes, can't we use that?
This model of special casting by creating interfaces does not look very sustainable long term
There was a problem hiding this comment.
A challenge in this case with using the FilterDevices was that the CEL filter runs after the gwInterfaces exclusion so by the time FilterDevices sees the devices, the RDMA NICs with default RA routes have already been removed so the filter can't include them back..
Should probably relate to your comment about gwInterfaces and improving that.
| // net.ipv6.conf.all.disable_ipv6=1). Soft-fail so that the caller can enable | ||
| // IPv6 per-interface and re-apply the address after moving the device. | ||
| if ip.To4() == nil && (errors.Is(err, unix.EACCES) || errors.Is(err, unix.EPERM)) { | ||
| klog.V(4).Infof("skipping IPv6 address %s on %s in namespace %s: IPv6 is disabled (will retry after per-interface enable)", address, nsLink.Attrs().Name, containerNsPAth) |
There was a problem hiding this comment.
I can't remember right now the exactl logic, where is this retried?
| // If EnableIPv6 is set, enable IPv6 per-interface and re-apply any IPv6 addresses | ||
| // that were skipped in nsAttachNetdev because IPv6 was globally disabled in the pod | ||
| // namespace. This is needed on platforms such as OKE where RDMA NICs require a | ||
| // globally-routable IPv6 address to populate a routable RoCEv2 GID. | ||
| if config.NetworkInterfaceConfigInPod.Interface.EnableIPv6 != nil && | ||
| *config.NetworkInterfaceConfigInPod.Interface.EnableIPv6 { | ||
| if err := enableIPv6ForInterface(ns, ifNameInNs); err != nil { | ||
| return fmt.Errorf("failed to enable IPv6 for %s in namespace %s: %w", ifNameInNs, ns, err) | ||
| } | ||
| if err := reapplyIPv6Addresses(ns, ifNameInNs, config.NetworkInterfaceConfigInPod.Interface.Addresses); err != nil { | ||
| return fmt.Errorf("failed to re-apply IPv6 addresses for %s in namespace %s: %w", ifNameInNs, ns, err) | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
oh, I see, https://github.com/kubernetes-sigs/dranet/pull/138/changes#r3063956282
can't we do better and set the sysctl before the addresses?
There was a problem hiding this comment.
Yeah, good point. We should be able to cut out the reapply by just doing:
enableIPv6ForInterface()
nsAttachNetdev()
There was a problem hiding this comment.
Oh but that has some problems. Currently enableIPv6ForInterface operates inside the pod ns on the interface after it's been moved so we'd need to restructure so the sysctl is set on the interface name inside the pod netns before address application.
nsAttachNetdev does move + address in one call, so we'd either need to:
- split nsAttachNetDev into move + configure phases
- have it accept a pre-configure callback
- enable IPV6 per-instance inside nsAttachNetDev when EnableIPv6 is set before applying addresses
The last one is probably easiest.
There was a problem hiding this comment.
agree handling EnableIPv6 inside nsAttachNetdev directly — since it already does move + address in one call, this should avoid the soft-fail + reapply dance entirely
| if hasChecker { | ||
| id := cloudprovider.DeviceIdentifiers{Name: device.Name} | ||
| if macAttr, ok := device.Attributes[apis.AttrMac]; ok && macAttr.StringValue != nil { | ||
| id.MAC = *macAttr.StringValue | ||
| } | ||
| if pciAttr, ok := device.Attributes[apis.AttrPCIAddress]; ok && pciAttr.StringValue != nil { | ||
| id.PCIAddress = *pciAttr.StringValue | ||
| } | ||
| if rdmaAttr, ok := device.Attributes[apis.AttrRDMA]; ok && rdmaAttr.BoolValue != nil { | ||
| id.RDMA = *rdmaAttr.BoolValue | ||
| } | ||
| if checker.IsNonUplink(id) { | ||
| klog.V(4).Infof("Interface %s has a default gateway route but is classified as a non-uplink by the cloud provider; including in discovery", *ifName) | ||
| filteredDevices = append(filteredDevices, device) | ||
| continue | ||
| } | ||
| } |
There was a problem hiding this comment.
this is the thing I prefer to think more about it... the gwInterfaces was added based on an assumption that it seems to no longer be true, can we make the gwInterfaces detection logic more resilient? I want us to think in a better architecture that is sustainable long term rather in just solving the problem at hand
|
/hold I'm +1 on the PR but I want us to think more holistically, holding to avoid an unexpected merge? |
|
@aojea thanks for the review! To your point:
I agree with this and many of the things added in the PR feel like patches rather than design decisions. I just can't test on other systems outside of OCI, so it is hard for me to abstract some of this stuff. I'll spend some time reviewing and provide some feedback for what could be a bit more robust. |
maybe we split OCI specific part out into a separate PR, which is relatively straightforward and can merge quickly, and spend some more time on refactoring & testing changes on the common code path |
|
@aojea - aligned! I thought that might be the outcome of the PR. Are you okay with me including the OCI + example in this PR even though example isn't reproducible until we get ipv6 stuff in? Or should I put the example in the ipv6 PR? |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@dkennetzoracle we are planning to do a release soon? is this ipv6 something you will want as part of the release? |
|
@aojea - possibly, how long are the release cycles generally? If I missed this one, how long until the next one? |
right now releases are ad hoc, in this case the only friction is about the mltiple gateway UX, the The multiple gateway thing I prefer we put some more thought on it, the original idea is that we do make it sane by default, so no customer breaks their nodes by putting the vm interface on the pod ... but with the situation you are describing is unclear to me if we can be smart about it ... I feel like we should be able to |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Joint PR (happy to split up, just wanted to get eyes on it and work with you all to do so).
NonUplinkCheckerinterface allows the OKE provider to override this for RDMA devices on GPU fabric shapes.net.ipv6.conf.all.disable_ipv6=1), but RoCEv2 on OKE requires a globally-routable IPv6 address on RDMA interfaces to populate a routable GID (GID index >= 2). A newEnableIPv6field onInterfaceConfigtriggers per-interface IPv6 enablement in the pod namespace via sysctl override, followed by re-application of the IPv6 address.Which issue(s) this PR is related to:
#111 - doesn't close
Special notes for your reviewer:
Note: Again, happy to split this up, but did want to get it out to start the process if needed. dranet on OKE won't work for all GPU SKUs without this.
Why OKE needs these changes but AKS and GCE do not
The IPv6 and uplink-filtering changes are driven by OKE's bare-metal
infrastructure model, which differs fundamentally from how AKS and GCE expose
RDMA networking:
AKS (Azure): GB300 nodes use InfiniBand VFs, not RoCEv2. IB GIDs are derived
from the port GUID (a hardware identifier burned into the HCA), not from IP
addresses. There is no dependency on IPv6 addressing for GID generation --
ibv_modify_qpresolves paths using GUID-based GIDs and the IB subnet manager.Because IB VFs have no Ethernet netdev, there is no RA daemon, no IPv6 address,
and no default-route side-effect to filter around. The dranet IB-only path
(char-device injection without netdev movement) handles this cleanly.
GCE (Google Cloud): GPU nodes use SR-IOV VFs for RDMA. The VFs are presented
as standard Ethernet NICs with IPv4 addresses managed by the GCE metadata
service. There is no RA daemon injecting IPv6 routes, and GIDs (when RoCEv2 is
used) are populated from the IPv4 address. The pod namespace does not need IPv6
enabled because the GID is IPv4-derived.
OKE (Oracle Cloud) -- why it is different: BM.GPU.GB200-v3.4 are bare-metal
shapes. The 8 ConnectX-8 NICs are full physical functions (PFs), not SR-IOV VFs.
OCI manages the RDMA fabric at the infrastructure level by running an RA
(Router Advertisement) daemon on the host that:
Assigns a globally-routable IPv6 address to each RDMA NIC (e.g.
fdcd:0:2a29:3032:364e:3a64:6758:428a/64). This address populates a RoCEv2GID at index >= 2 in the NIC's GID table. The OCI fabric routes traffic using
these IPv6-derived GIDs -- link-local GIDs (
fe80::) are not routable(
ENETUNREACHonibv_modify_qp).Injects an IPv6 default route via RA (
proto ra) into the main routing tablefor every RDMA NIC. This is an infrastructure side-effect, not a signal that
the NIC is an uplink/transit interface. Without the
NonUplinkCheckeroverride, dranet's gateway-interface filter removes these NICs from the
ResourceSlice entirely.
The IPv6 problem compounds because Kubernetes single-stack IPv4 clusters (the
default on OKE) configure
net.ipv6.conf.all.disable_ipv6=1in pod networknamespaces. When dranet moves the RDMA NIC into the pod namespace and attempts
to apply the RA-assigned IPv6 address, the kernel returns EACCES. Without the
address, the NIC has no routable GID and NCCL falls back to TCP or fails with
ENETUNREACH.
The
EnableIPv6mechanism solves this with a per-interface sysctl override(
net.ipv6.conf.<ifname>.disable_ipv6=0) that enables IPv6 on just the RDMANIC without affecting the pod's primary network interface. This is safe because
the RDMA NIC is isolated in the pod namespace and the IPv6 address is only used
for GID generation, not for IP-level routing.
None of this applies to AKS (IB, GUID-based GIDs, no netdev) or GCE (VFs,
IPv4-based GIDs, no RA daemon).
Architecture context
On GB200, GPUs connect to the Grace CPU via NVLink C2C while NICs connect via
PCIe.
nvidia-smi topo -mreports SYS for all GPU-NIC pairs. Despite this,NCCL enables GDR for NUMA-local NICs via
NCCL_NET_GDR_C2C=1. The meaningfultopology distinction is same-NUMA vs cross-NUMA, not PCIe host bridge alignment.
IPv6 flow
The
EnableIPv6mechanism is intentionally outsidensAttachNetdevto keepthat function general-purpose. The sequence is:
nsAttachNetdevmoves NIC to pod namespace, soft-fails IPv6 address (EACCES)attachNetdevToNSenables IPv6 per-interface sysctl, re-applies IPv6 addressKnown issue -- NIC orphaning - #137
RDMA NICs are not reliably returned to the host namespace on pod deletion. This
is a pre-existing CRI-O / dranet bug (not introduced here). The example README documents
the PCI rebind recovery procedure.
Benchmark results
2-node
all_reduce_perf, 1 GPU/worker, BM.GPU.GB200-v3.4:1nic-aligned2nic-aligned1nic-unalignedDoes this PR introduce a user-facing change?