accurately detect default gateways#153
Conversation
Address several critical edge cases in default interface detection and introduces a robust, rootless networking test framework. 1. Point-to-Point Interfaces: Removed the `r.Gw == nil` check. Previously, this caused the agent to completely ignore active VPNs and tunnels (like Wireguard or tun/tap) because P2P links route directly out of the device without a Gateway IP. 2. Route Metrics: The kernel relies on metrics (Priority) to determine the active route in a multi-WAN setup. The old code ignored this, returning all interfaces. It now correctly isolates the lowest metric for IPv4 and IPv6 independently, while preserving ECMP support. 3. Multipath Link Lookup: Fixed a bug where multipath routes were queried using the parent route's link index (which is often 0) rather than the nexthop's link index. Signed-off-by: Antonio Ojea <aojea@google.com>
|
@aojea: GitHub didn't allow me to assign the following users: dkennetzoracle, tamilmani1989. Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
| // as default gateways in the main routing table. It identifies these by querying | ||
| // the main routing table for routes with an unspecified destination (0.0.0.0/0 | ||
| // for IPv4 or ::/0 for IPv6). | ||
| // as active default gateways in the main routing table, respecting route metrics. |
There was a problem hiding this comment.
just a note / side-effect, any route matching Dst == 0.0.0.0/0 or ::/0 in the main table now classifies the link as an uplink, which is a new semantic and probably worth a comment?
There was a problem hiding this comment.
is that heuristic not correct?
There was a problem hiding this comment.
it is the same behavior as before, isn;t it?
it just expands the logic to take into consideration the weights, but previous logic about the 0.0.0.0 and ::/0 remains the same
There was a problem hiding this comment.
It is the same logic, but the route no longer has to be a gateway. Any route with scope=link for example but with no gateway but still matches 0.0.0.0/0 or ::/0 will also match here.
There was a problem hiding this comment.
+1 on semantic broadening — "default route with a gateway" to "any default route"
There was a problem hiding this comment.
what is the difference, from the routing perspective the fact of having a next hop direction or just send to via the corresponding interface is not important, what is important is that any packet in the host will go through that specific interface ... that is why we don't want to expose it in the ResourceSlice
There was a problem hiding this comment.
I don't feel super strongly about it, (to me) it just adds clarity.
| @@ -0,0 +1,102 @@ | |||
| package testutils | |||
There was a problem hiding this comment.
do you need a license header? I see them everywhere else.
| }, | ||
| expectedResult: sets.New[string]("eth0", "eth2"), | ||
| }, | ||
| { |
There was a problem hiding this comment.
I'd add a test for Mixed-family default on a single interface. Basically, when you merge sets for ipv4 and ipv6 this asserts that the merge behaves (dual stacking an interface).
{
name: "Same interface wins both families",
setupRoutes: func() error {
if err := netlink.RouteAdd(&netlink.Route{Family: netlink.FAMILY_V4, Dst: defaultIPv4, Gw: gwIPv4, LinkIndex: links["eth0"].Attrs().Index, Priority: 100, Table: unix.RT_TABLE_MAIN}); err != nil {
return err
}
return netlink.RouteAdd(&netlink.Route{Family: netlink.FAMILY_V6, Dst: defaultIPv6, Gw: gwIPv6, LinkIndex: links["eth0"].Attrs().Index, Priority: 100, Table: unix.RT_TABLE_MAIN})
},
expectedResult: sets.New[string]("eth0"),
},| return netlink.RouteAdd(&netlink.Route{Family: netlink.FAMILY_V4, Dst: defaultIPv4, Gw: gwIPv4, LinkIndex: links["eth0"].Attrs().Index, Priority: 100, Table: unix.RT_TABLE_MAIN}) | ||
| }, | ||
| expectedResult: sets.New[string]("wg0"), | ||
| }, |
There was a problem hiding this comment.
Would it be worth also having a pinned Priority: 0 for P2P case?
You re using Prio 50 for wg0 and this exercises the GW == nil removal as well as the the lowest metric wins logic. However, it doesn't exercise a specific case where a P2P route has Priority: 0 which is the ipv4 default for a route installed without an explicit metric. This case would just lock in that 0 is a real comparable value.
OK with not doing it, it just feels a bit more explicit and defensive.
{
name: "P2P default with Priority 0 wins over gateway'd default",
setupRoutes: func() error {
if err := netlink.RouteAdd(&netlink.Route{Family: netlink.FAMILY_V4, Dst: defaultIPv4, LinkIndex: links["wg0"].Attrs().Index, Priority: 0, Scope: netlink.SCOPE_LINK, Table: unix.RT_TABLE_MAIN}); err != nil {
return err
}
return netlink.RouteAdd(&netlink.Route{Family: netlink.FAMILY_V4, Dst: defaultIPv4, Gw: gwIPv4, LinkIndex: links["eth0"].Attrs().Index, Priority: 100, Table: unix.RT_TABLE_MAIN})
},
expectedResult: sets.New[string]("wg0"),
},| // It caches the result after the first check. | ||
| func IsSupported() bool { | ||
| checkOnce.Do(func() { | ||
| cmd := exec.Command("sleep", "1") |
There was a problem hiding this comment.
A small nit: this requires sleep to be on the system in PATH. If it's not, you'd get an error and it would be because of a missing binary not because user namespaces are not supported. Could check it?
There was a problem hiding this comment.
maybe exec.Command(os.Args[0], "-test.run=^$") to avoid the external sleep dependency
| cmd.Args = []string{os.Args[0], "-test.run=" + t.Name() + "$", "-test.v=true"} | ||
|
|
||
| for _, arg := range os.Args { | ||
| if strings.HasPrefix(arg, "-test.testlogfile=") { |
There was a problem hiding this comment.
would there be any reason to add -test.timeout here? Any of these long?
dkennetzoracle
left a comment
There was a problem hiding this comment.
Some nits / questions, but I think it's a great add and a correctness fix. I'm not 100% sure it fixes the NonUplinkChecker situation on #138 on OKE, I'd need to double check. RA injects IPv6 defaults for us. If they share a metric (which I think they do) I still need something like NonUplinkChecker. If they don't I can drop NonUplinkChecker.
This definitely addresses one of the sub-issues in that PR, and is correct. LGTM!
@dkennetzoracle I'm very much interested in hearing about this after you've had the opportunity to try this out |
|
@gauravkghildiyal - I will let you know! It's very hard for me to get access to GB200+ so I need to make the most of my time on them haha. I'm pretty sure each NIC gets a default route for ipv6 and these use the kernel's RA priority (so 1024) so we wouldn't be able to differentiate the 8 RDMA NICs from each other by metric. They all look like peer's from he kernel's pov. So they'd all get excluded still here, haha. But it should work for the Azure case |
|
I just verified this PR on AWS p4d.24large, the primary NiC with default gateway/route (e.g. ens32) is excluded, and 4 EFA/ENA devices are properly included in the resource slice:
|
|
verified this PR on Azure GB300, the primary NiC (eth0) with default gateway is excluded, while Azure accelerated networking VF (e.g. enP8051s1) together with other mlx5_* NiCs are included
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: anson627, aojea, dkennetzoracle The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Holding to have time to go through this and allow resolution of a few comments. /hold |
From a kubernetes architectural and security perspective, we like to have certain guardrails so users can not shoot themselves on the feet. Any logic that allows an interface with a default route to be unmounted or modified based on an opportunistic RA injection is very risky.:
I still think we need a better mechanism for improving filtering, so I encourage you to check @gauravkghildiyal proposal in #152 , that I think will be able to accomodate the |
Address several critical edge cases in default interface detection and introduces a robust, rootless networking test framework.
r.Gw == nilcheck. Previously, this caused the agent to completely ignore active VPNs and tunnels (like Wireguard or tun/tap) because P2P links route directly out of the device without a Gateway IP.