Skip to content

accurately detect default gateways#153

Open
aojea wants to merge 1 commit intokubernetes-sigs:mainfrom
aojea:detect_gw
Open

accurately detect default gateways#153
aojea wants to merge 1 commit intokubernetes-sigs:mainfrom
aojea:detect_gw

Conversation

@aojea
Copy link
Copy Markdown
Contributor

@aojea aojea commented Apr 16, 2026

Address several critical edge cases in default interface detection and introduces a robust, rootless networking test framework.

  1. Point-to-Point Interfaces: Removed the r.Gw == nil check. Previously, this caused the agent to completely ignore active VPNs and tunnels (like Wireguard or tun/tap) because P2P links route directly out of the device without a Gateway IP.
  2. Route Metrics: The kernel relies on metrics (Priority) to determine the active route in a multi-WAN setup. The old code ignored this, returning all interfaces. It now correctly isolates the lowest metric for IPv4 and IPv6 independently, while preserving ECMP support.
  3. Multipath Link Lookup: Fixed a bug where multipath routes were queried using the parent route's link index (which is often 0) rather than the nexthop's link index.

Address several critical edge cases in default interface
detection and introduces a robust, rootless networking test framework.

1. Point-to-Point Interfaces: Removed the `r.Gw == nil` check.
   Previously, this caused the agent to completely ignore active VPNs
   and tunnels (like Wireguard or tun/tap) because P2P links route
   directly out of the device without a Gateway IP.
2. Route Metrics: The kernel relies on metrics (Priority) to determine
   the active route in a multi-WAN setup. The old code ignored this,
   returning all interfaces. It now correctly isolates the lowest metric
   for IPv4 and IPv6 independently, while preserving ECMP support.
3. Multipath Link Lookup: Fixed a bug where multipath routes were queried
   using the parent route's link index (which is often 0) rather than
   the nexthop's link index.

Signed-off-by: Antonio Ojea <aojea@google.com>
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 16, 2026
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 16, 2026
@aojea
Copy link
Copy Markdown
Contributor Author

aojea commented Apr 16, 2026

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@aojea: GitHub didn't allow me to assign the following users: dkennetzoracle, tamilmani1989.

Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

Details

In response to this:

/assign @gauravkghildiyal @dkennetzoracle @tamilmani1989

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Comment thread pkg/inventory/net.go
// as default gateways in the main routing table. It identifies these by querying
// the main routing table for routes with an unspecified destination (0.0.0.0/0
// for IPv4 or ::/0 for IPv6).
// as active default gateways in the main routing table, respecting route metrics.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a note / side-effect, any route matching Dst == 0.0.0.0/0 or ::/0 in the main table now classifies the link as an uplink, which is a new semantic and probably worth a comment?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that heuristic not correct?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is the same behavior as before, isn;t it?
it just expands the logic to take into consideration the weights, but previous logic about the 0.0.0.0 and ::/0 remains the same

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the same logic, but the route no longer has to be a gateway. Any route with scope=link for example but with no gateway but still matches 0.0.0.0/0 or ::/0 will also match here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on semantic broadening — "default route with a gateway" to "any default route"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the difference, from the routing perspective the fact of having a next hop direction or just send to via the corresponding interface is not important, what is important is that any packet in the host will go through that specific interface ... that is why we don't want to expose it in the ResourceSlice

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel super strongly about it, (to me) it just adds clarity.

@@ -0,0 +1,102 @@
package testutils
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need a license header? I see them everywhere else.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, good catch

Comment thread pkg/inventory/net_test.go
},
expectedResult: sets.New[string]("eth0", "eth2"),
},
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a test for Mixed-family default on a single interface. Basically, when you merge sets for ipv4 and ipv6 this asserts that the merge behaves (dual stacking an interface).

{
    name: "Same interface wins both families",
    setupRoutes: func() error {
        if err := netlink.RouteAdd(&netlink.Route{Family: netlink.FAMILY_V4, Dst: defaultIPv4, Gw: gwIPv4, LinkIndex: links["eth0"].Attrs().Index, Priority: 100, Table: unix.RT_TABLE_MAIN}); err != nil {
            return err
        }
        return netlink.RouteAdd(&netlink.Route{Family: netlink.FAMILY_V6, Dst: defaultIPv6, Gw: gwIPv6, LinkIndex: links["eth0"].Attrs().Index, Priority: 100, Table: unix.RT_TABLE_MAIN})
    },
    expectedResult: sets.New[string]("eth0"),
},

Comment thread pkg/inventory/net_test.go
return netlink.RouteAdd(&netlink.Route{Family: netlink.FAMILY_V4, Dst: defaultIPv4, Gw: gwIPv4, LinkIndex: links["eth0"].Attrs().Index, Priority: 100, Table: unix.RT_TABLE_MAIN})
},
expectedResult: sets.New[string]("wg0"),
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth also having a pinned Priority: 0 for P2P case?

You re using Prio 50 for wg0 and this exercises the GW == nil removal as well as the the lowest metric wins logic. However, it doesn't exercise a specific case where a P2P route has Priority: 0 which is the ipv4 default for a route installed without an explicit metric. This case would just lock in that 0 is a real comparable value.

OK with not doing it, it just feels a bit more explicit and defensive.

{
    name: "P2P default with Priority 0 wins over gateway'd default",
    setupRoutes: func() error {
        if err := netlink.RouteAdd(&netlink.Route{Family: netlink.FAMILY_V4, Dst: defaultIPv4, LinkIndex: links["wg0"].Attrs().Index, Priority: 0, Scope: netlink.SCOPE_LINK, Table: unix.RT_TABLE_MAIN}); err != nil {
            return err
        }
        return netlink.RouteAdd(&netlink.Route{Family: netlink.FAMILY_V4, Dst: defaultIPv4, Gw: gwIPv4, LinkIndex: links["eth0"].Attrs().Index, Priority: 100, Table: unix.RT_TABLE_MAIN})
    },
    expectedResult: sets.New[string]("wg0"),
},

// It caches the result after the first check.
func IsSupported() bool {
checkOnce.Do(func() {
cmd := exec.Command("sleep", "1")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small nit: this requires sleep to be on the system in PATH. If it's not, you'd get an error and it would be because of a missing binary not because user namespaces are not supported. Could check it?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe exec.Command(os.Args[0], "-test.run=^$") to avoid the external sleep dependency

cmd.Args = []string{os.Args[0], "-test.run=" + t.Name() + "$", "-test.v=true"}

for _, arg := range os.Args {
if strings.HasPrefix(arg, "-test.testlogfile=") {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would there be any reason to add -test.timeout here? Any of these long?

Copy link
Copy Markdown
Contributor

@dkennetzoracle dkennetzoracle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits / questions, but I think it's a great add and a correctness fix. I'm not 100% sure it fixes the NonUplinkChecker situation on #138 on OKE, I'd need to double check. RA injects IPv6 defaults for us. If they share a metric (which I think they do) I still need something like NonUplinkChecker. If they don't I can drop NonUplinkChecker.

This definitely addresses one of the sub-issues in that PR, and is correct. LGTM!

@gauravkghildiyal
Copy link
Copy Markdown
Member

I'm not 100% sure it fixes the NonUplinkChecker situation on #138 on OKE, I'd need to double check.

@dkennetzoracle I'm very much interested in hearing about this after you've had the opportunity to try this out

@dkennetzoracle
Copy link
Copy Markdown
Contributor

@gauravkghildiyal - I will let you know! It's very hard for me to get access to GB200+ so I need to make the most of my time on them haha. I'm pretty sure each NIC gets a default route for ipv6 and these use the kernel's RA priority (so 1024) so we wouldn't be able to differentiate the 8 RDMA NICs from each other by metric. They all look like peer's from he kernel's pov.

So they'd all get excluded still here, haha. But it should work for the Azure case

@anson627
Copy link
Copy Markdown
Contributor

anson627 commented Apr 16, 2026

I just verified this PR on AWS p4d.24large, the primary NiC with default gateway/route (e.g. ens32) is excluded, and 4 EFA/ENA devices are properly included in the resource slice:

Device Type Interface PCI Address NUMA IP
pci-0000-10-01-0 ENA ens33 0000:10:01.0 0 192.168.157.168/19
pci-0000-10-1b-0 EFA 0000:10:1b.0 0 rdmap16s27
pci-0000-20-01-0 ENA ens65 0000:20:01.0 0 192.168.155.210/19
pci-0000-20-1b-0 EFA 0000:20:1b.0 0 rdmap32s27
pci-0000-90-01-0 ENA ens129 0000:90:01.0 1 192.168.145.129/19
pci-0000-90-1b-0 EFA 0000:90:1b.0 1 rdmap144s27
pci-0000-a0-01-0 ENA ens161 0000:a0:01.0 1 192.168.132.181/19
pci-0000-a0-1b-0 EFA 0000:a0:1b.0 1 rdmap160s27

@anson627
Copy link
Copy Markdown
Contributor

anson627 commented Apr 16, 2026

verified this PR on Azure GB300, the primary NiC (eth0) with default gateway is excluded, while Azure accelerated networking VF (e.g. enP8051s1) together with other mlx5_* NiCs are included

# Name RDMA Device PCI Address NUMA Vendor RDMA
1 pci-0101-00-00-0 mlx5_0 0101:00:00.0 0 Mellanox true
2 pci-0102-00-00-0 mlx5_1 0102:00:00.0 0 Mellanox true
3 pci-0103-00-00-0 mlx5_2 0103:00:00.0 1 Mellanox true
4 pci-0104-00-00-0 mlx5_3 0104:00:00.0 1 Mellanox true
5 pci-1f73-00-02-0 1f73:00:02.0 0 Mellanox true

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 16, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: anson627, aojea, dkennetzoracle

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gauravkghildiyal
Copy link
Copy Markdown
Member

Holding to have time to go through this and allow resolution of a few comments.

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 16, 2026
@aojea
Copy link
Copy Markdown
Contributor Author

aojea commented Apr 17, 2026

RA injects IPv6 defaults for us. If they share a metric (which I think they do) I still need something like NonUplinkChecker. If they don't I can drop NonUplinkChecker.

From a kubernetes architectural and security perspective, we like to have certain guardrails so users can not shoot themselves on the feet. Any logic that allows an interface with a default route to be unmounted or modified based on an opportunistic RA injection is very risky.:

  • In the Kubernetes networking model, there is typically no difference between the control plane and the dataplane traffic. If an interface intended for the dataplane receives a default route via RA, it will be 'hijacking' control plane traffic, so the kubelet and pods trying to connect to the apiserver will fail (unless this secondary interface has access too, but this looks a very convoluted setup with routing loops). This will effectively disconnect the entire node from the cluster.

  • We already were hit by security issues "abusing" RA, so right now is RA processing disabled by default in all container interfaces, the same attack can be performed at the node level

A vulnerability was found in all versions of containernetworking/plugins before version 0.8.6, that allows malicious containers in Kubernetes clusters to perform man-in-the-middle (MitM) attacks. A malicious container can exploit this flaw by sending rogue IPv6 router advertisements to the host or other containers, to redirect traffic to the malicious container. (CVE-2020-10749)

I still think we need a better mechanism for improving filtering, so I encourage you to check @gauravkghildiyal proposal in #152 , that I think will be able to accomodate the NonUplinkChecker functionality perfectly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants