feat: default network policy per SandboxTemplate#287
feat: default network policy per SandboxTemplate#287vicentefb wants to merge 1 commit intokubernetes-sigs:mainfrom
Conversation
✅ Deploy Preview for agent-sandbox canceled.
|
bd0846b to
2d4716d
Compare
|
/ok-to-test |
6fe4b6e to
2b17f78
Compare
0c92531 to
10b3d5b
Compare
39461fb to
081c27f
Compare
| const ( | ||
| poolLabel = "agents.x-k8s.io/pool" | ||
| sandboxTemplateRefHash = "agents.x-k8s.io/sandbox-template-ref-hash" | ||
| sandboxTemplateRefHash = "agents.x-k8s.io/template-name-hash" |
There was a problem hiding this comment.
I renamed the label key to agents.x-k8s.io/template-name-hash to match the SandboxClaim controller. This fixes a bug where warm pool pods were not being selected by the default "Deny-All" NetworkPolicy, leaving them unsecured.
There was a problem hiding this comment.
It's not safe to rename labels. How are old labels on existing resources handled?
There was a problem hiding this comment.
sorry, I have reverted the label string back to its original value ("agents.x-k8s.io/sandbox-template-ref-hash") across both the warmpool and claim controllers.
The new template-level NetworkPolicy will now use this original label for its PodSelector. This ensures that any existing pods sitting in the warm pool prior to this upgrade will be seamlessly selected and secured by the new default NetworkPolicy without being orphaned. I also ensured we no longer delete() this label during pod adoption, so the policy remains attached for the pod's entire lifecycle.
081c27f to
b335c0c
Compare
8f7b17f to
b5d6e64
Compare
| // created from this template. A single shared NetworkPolicy is created per Template. | ||
| // If this field is omitted (nil), the controller applies a SECURE DEFAULT policy: | ||
| // - Ingress: Allow traffic ONLY from the Sandbox Router. All other ingress is denied. | ||
| // - Egress: Allow DNS Only (UDP/TCP Port 53). All other traffic (including Internet and Metadata Server) is blocked. |
There was a problem hiding this comment.
Comment needs update now that we allow internet egress but not internal DNS
There was a problem hiding this comment.
+1, and PR description needs to be updated as well
There was a problem hiding this comment.
Done. I've updated the comments in sandboxtemplate_types.go and the PR description to accurately reflect the new egress posture (Public Internet allowed, Internal/Private CIDRs blocked).
|
/lgtm |
|
/assign @janetkuo |
| // created from this template. A single shared NetworkPolicy is created per Template. | ||
| // If this field is omitted (nil), the controller applies a SECURE DEFAULT policy: | ||
| // - Ingress: Allow traffic ONLY from the Sandbox Router. All other ingress is denied. | ||
| // - Egress: Allow DNS Only (UDP/TCP Port 53). All other traffic (including Internet and Metadata Server) is blocked. |
There was a problem hiding this comment.
+1, and PR description needs to be updated as well
| @@ -549,65 +554,109 @@ func (r *SandboxClaimReconciler) SetupWithManager(mgr ctrl.Manager) error { | |||
| func (r *SandboxClaimReconciler) reconcileNetworkPolicy(ctx context.Context, claim *extensionsv1alpha1.SandboxClaim, template *extensionsv1alpha1.SandboxTemplate) error { | |||
There was a problem hiding this comment.
After this change, network policy becomes a shared resource, and having every individual SandboxClaim reconciliation attempt to CreateOrUpdate this shared object can lead to significant 409 conflicts from the API server, and unnecessary Get calls when scaling to thousands of sandboxes.
There was a problem hiding this comment.
Could this be cached locally in an informer to avoid the contention?
There was a problem hiding this comment.
I've replaced CreateOrUpdate with a cached Get -> Create flow. The controller now checks the local informer cache first. If the policy exists, it assumes it's valid and moves on. If it's missing, it attempts to Create it, safely catching and ignoring IsAlreadyExists errors to handle race conditions between claims. This eliminates the 409 conflicts and unnecessary Update calls.
| "172.16.0.0/12", // Block Private Class B | ||
| "192.168.0.0/16", // Block Private Class C | ||
| "169.254.0.0/16", // Block Link-Local (Metadata Server) | ||
| }, |
There was a problem hiding this comment.
Some Kubernetes environments use non-standard ranges for Pod or Service CIDRs. Consider making them configurable or dynamically detecting the cluster's private ranges to avoid bypasses.
There was a problem hiding this comment.
These are all already configurable via the sandbox template, aren't they?
There was a problem hiding this comment.
Yes, our strategy here is to provide a "Secure by Default" baseline targeting standard RFC1918 private ranges. If a cluster administrator is running non-standard IP ranges (like public IPs for pods), they can easily override this default by explicitly defining the networkPolicy block in their SandboxTemplate. I've added a note to the README mentioning this exact scenario.
| const ( | ||
| poolLabel = "agents.x-k8s.io/pool" | ||
| sandboxTemplateRefHash = "agents.x-k8s.io/sandbox-template-ref-hash" | ||
| sandboxTemplateRefHash = "agents.x-k8s.io/template-name-hash" |
There was a problem hiding this comment.
It's not safe to rename labels. How are old labels on existing resources handled?
| // Public Internet Access (Strict Isolation) | ||
| // This rule allows all ports to PUBLIC IPs, but explicitly blocks private LAN ranges. | ||
| // NOTE: This intentionally blocks internal cluster DNS (CoreDNS) by default to prevent | ||
| // agents from probing for service discovery and leaking internal service names. |
There was a problem hiding this comment.
This will cause sandboxes to fail any domain resolution by default and potentially breaking most agent workloads. Being so strict by default could potentially have the risk of users just copy-pasting allow-all policies just to make their agents work.
There was a problem hiding this comment.
This should still allow resolution of public domains by default.
There was a problem hiding this comment.
I'm much more worried about all agents ending up in an allow-all posture by default than I am about users overriding with overly-permissive policy in their sandbox templates, though it's a valid concern. With the latter at least they had to explicitly open the door.
There was a problem hiding this comment.
To fix this without allowing internal DNS enumeration, I've added a secure default to createSandbox. If the user hasn't specified a DNSPolicy, we automatically inject DNSPolicy: None and set the Nameservers to ["8.8.8.8", "1.1.1.1"].
This completely bypasses CoreDNS. The agent can instantly resolve public internet domains out of the box, but it is physically incapable of resolving internal .cluster.local addresses.
| // NetworkPolicy defines the network policy to be applied to the sandboxes | ||
| // created from this template. | ||
| // created from this template. A single shared NetworkPolicy is created per Template. | ||
| // If this field is omitted (nil), the controller applies a SECURE DEFAULT policy: |
There was a problem hiding this comment.
This is a breaking change. It requires a release note if we need to change the default.
There was a problem hiding this comment.
What's the change policy in this repo? AIUI this is all currently alpha/experimental?
There was a problem hiding this comment.
Even though it still alpha, we should still communicate breaking changes to users to avoid surprises.
There was a problem hiding this comment.
Ack, sounds good. What's the process for adding release notes?
There was a problem hiding this comment.
I've added the release-note-action-required label. This can follow #191 to add a block that will be published in the release notes.
nit update update to fix e2e tests policy per template nit updated api comments update update update nit updated update update update
b5d6e64 to
2725b31
Compare
|
New changes are detected. LGTM label has been removed. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: aditya-shantanu, mtaufen, vicentefb The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| if template.Spec.NetworkPolicy == nil { | ||
| np.Spec = buildDefaultNetworkPolicySpec(template.Name) |
There was a problem hiding this comment.
suggestion: Add a feature toggle
This change forces people to use a shared network policy for the whole pool, which might not be desirable.
For example: we are currently using a network policy per sandbox to be able to control something like internet connectivity for each sandbox individually. So having a shared policy that is generated by default will cause some interesting side effects in our system.
If we want to stay with building default specs, I would at least recommend making it a feature toggle, so that people can opt out if they want/need to.
fixes #263
This PR updates the
sandboxclaim_controllerto enforce a "Secure by Default" network posture and introduces a highly scalable NetworkPolicy architecture.Previous Behavior:
If a
SandboxTemplatedid not specify aNetworkPolicy, the controller created no policy, effectively granting theSandboxunrestricted network access (depending on the cluster CNI default). This allowed untrusted workloads to potentially access the Node Metadata Server, the host network, or laterally scan other pods in the cluster. Furthermore, policies were envisioned as 1:1 perSandbox, which would cause significant overhead at scale.New Behavior:
Shared Network Policies & Reduced API Contention: Network Policies are now created per
SandboxTemplate, rather than perSandboxClaim. When 1,000 sandboxes are spawned from the same template, they now share a singleNetworkPolicyobject. To preventAPI Server 409conflicts and reduce load during mass-reconciliation, the controller utilizes a cachedGet -> Createflow (ignoringAlreadyExists) rather thanCreateOrUpdate.Label Management & Backward Compatibility: The controller injects a template-hash label into
Sandboxpods to ensure the sharedNetworkPolicycorrectly selects them (supporting both Warm Pool adoption and cold starts). Label strings have been carefully maintained to ensure existing WarmPool pods are not orphaned during controller upgrades.Secure By Default Baseline:
If
spec.networkPolicyis omitted in theSandboxTemplate, the controller automatically creates a strict default NetworkPolicy with:sandbox-router. All other pod-to-pod ingress is strictly denied.8.8.8.8,1.1.1.1) into the Sandbox pod spec. This allows agents to resolve public domains out-of-the-box without exposing the cluster's internal DNS architecture.spec.networkPolicy, the controller respects it as-is, automatically wrapping it with the correct PolicyTypes and Pod Selectors. Users can utilize hostAliases in their templates for local LLM routing without opening DNS holes (example added to docs).example: