Skip to content

feat: default network policy per SandboxTemplate#287

Open
vicentefb wants to merge 1 commit intokubernetes-sigs:mainfrom
vicentefb:defaultNetworkPolicy
Open

feat: default network policy per SandboxTemplate#287
vicentefb wants to merge 1 commit intokubernetes-sigs:mainfrom
vicentefb:defaultNetworkPolicy

Conversation

@vicentefb
Copy link
Member

@vicentefb vicentefb commented Feb 5, 2026

fixes #263

This PR updates the sandboxclaim_controller to enforce a "Secure by Default" network posture and introduces a highly scalable NetworkPolicy architecture.

Previous Behavior:
If a SandboxTemplate did not specify a NetworkPolicy, the controller created no policy, effectively granting the Sandbox unrestricted network access (depending on the cluster CNI default). This allowed untrusted workloads to potentially access the Node Metadata Server, the host network, or laterally scan other pods in the cluster. Furthermore, policies were envisioned as 1:1 per Sandbox, which would cause significant overhead at scale.

New Behavior:

  • Shared Network Policies & Reduced API Contention: Network Policies are now created per SandboxTemplate, rather than per SandboxClaim. When 1,000 sandboxes are spawned from the same template, they now share a single NetworkPolicy object. To prevent API Server 409 conflicts and reduce load during mass-reconciliation, the controller utilizes a cached Get -> Create flow (ignoring AlreadyExists) rather than CreateOrUpdate.

  • Label Management & Backward Compatibility: The controller injects a template-hash label into Sandbox pods to ensure the shared NetworkPolicy correctly selects them (supporting both Warm Pool adoption and cold starts). Label strings have been carefully maintained to ensure existing WarmPool pods are not orphaned during controller upgrades.

Secure By Default Baseline:
If spec.networkPolicy is omitted in the SandboxTemplate, the controller automatically creates a strict default NetworkPolicy with:

  • Ingress: Allows traffic only from the sandbox-router. All other pod-to-pod ingress is strictly denied.
  • Egress (Strict Isolation): Allows egress to the Public Internet, but strictly blocks traffic to Private LAN ranges (RFC1918), the Node Metadata Server, and Internal Cluster DNS (CoreDNS). This prevents lateral movement, credential theft, and internal service enumeration probing.
  • DNS Resolution (CoreDNS Bypass): Because internal DNS is blocked, the controller automatically injects dnsPolicy: None and public nameservers (8.8.8.8, 1.1.1.1) into the Sandbox pod spec. This allows agents to resolve public domains out-of-the-box without exposing the cluster's internal DNS architecture.
  • Pass-Through Customization: If a user does provide a spec.networkPolicy, the controller respects it as-is, automatically wrapping it with the correct PolicyTypes and Pod Selectors. Users can utilize hostAliases in their templates for local LLM routing without opening DNS holes (example added to docs).

example:

kubectl get networkpolicy -o yaml
apiVersion: v1
items:
- apiVersion: networking.k8s.io/v1
  kind: NetworkPolicy
  metadata:
    name: limited-egress-template-network-policy
    namespace: default
    ownerReferences:
    - apiVersion: extensions.agents.x-k8s.io/v1alpha1
      blockOwnerDeletion: true
      controller: true
      kind: SandboxTemplate
      name: limited-egress-template
      uid: 332fc874-157f-4f78-b040-64e3fa02a646
  spec:
    egress:
    - to:
      - ipBlock:
          cidr: 0.0.0.0/0
          except:
          - 10.0.0.0/8
          - 172.16.0.0/12
          - 192.168.0.0/16
          - 169.254.0.0/16
      - ipBlock:
          cidr: ::/0
          except:
          - fc00::/7
    ingress:
    - from:
      - podSelector:
          matchLabels:
            app: sandbox-router
    podSelector:
      matchLabels:
        agents.x-k8s.io/template-name-hash: 181aaabd
    policyTypes:
    - Ingress
    - Egress

@netlify
Copy link

netlify bot commented Feb 5, 2026

Deploy Preview for agent-sandbox canceled.

Name Link
🔨 Latest commit 2725b31
🔍 Latest deploy log https://app.netlify.com/projects/agent-sandbox/deploys/699f6065e4a9d3000895e872

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Feb 5, 2026
@vicentefb vicentefb force-pushed the defaultNetworkPolicy branch from bd0846b to 2d4716d Compare February 5, 2026 06:49
@vicentefb
Copy link
Member Author

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Feb 5, 2026
@vicentefb vicentefb force-pushed the defaultNetworkPolicy branch 2 times, most recently from 6fe4b6e to 2b17f78 Compare February 11, 2026 02:52
@vicentefb vicentefb requested a review from acsoto February 11, 2026 03:50
@vicentefb vicentefb force-pushed the defaultNetworkPolicy branch 3 times, most recently from 0c92531 to 10b3d5b Compare February 12, 2026 01:03
@vicentefb vicentefb changed the title feat: default network policy feat: default network policy per SandboxTemplate Feb 12, 2026
@vicentefb vicentefb force-pushed the defaultNetworkPolicy branch 4 times, most recently from 39461fb to 081c27f Compare February 18, 2026 00:39
const (
poolLabel = "agents.x-k8s.io/pool"
sandboxTemplateRefHash = "agents.x-k8s.io/sandbox-template-ref-hash"
sandboxTemplateRefHash = "agents.x-k8s.io/template-name-hash"
Copy link
Member Author

@vicentefb vicentefb Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed the label key to agents.x-k8s.io/template-name-hash to match the SandboxClaim controller. This fixes a bug where warm pool pods were not being selected by the default "Deny-All" NetworkPolicy, leaving them unsecured.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not safe to rename labels. How are old labels on existing resources handled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I have reverted the label string back to its original value ("agents.x-k8s.io/sandbox-template-ref-hash") across both the warmpool and claim controllers.

The new template-level NetworkPolicy will now use this original label for its PodSelector. This ensures that any existing pods sitting in the warm pool prior to this upgrade will be seamlessly selected and secured by the new default NetworkPolicy without being orphaned. I also ensured we no longer delete() this label during pod adoption, so the policy remains attached for the pod's entire lifecycle.

@vicentefb vicentefb force-pushed the defaultNetworkPolicy branch from 081c27f to b335c0c Compare February 18, 2026 00:47
@vicentefb vicentefb force-pushed the defaultNetworkPolicy branch 2 times, most recently from 8f7b17f to b5d6e64 Compare February 23, 2026 19:03
// created from this template. A single shared NetworkPolicy is created per Template.
// If this field is omitted (nil), the controller applies a SECURE DEFAULT policy:
// - Ingress: Allow traffic ONLY from the Sandbox Router. All other ingress is denied.
// - Egress: Allow DNS Only (UDP/TCP Port 53). All other traffic (including Internet and Metadata Server) is blocked.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment needs update now that we allow internet egress but not internal DNS

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, and PR description needs to be updated as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I've updated the comments in sandboxtemplate_types.go and the PR description to accurately reflect the new egress posture (Public Internet allowed, Internal/Private CIDRs blocked).

@mtaufen
Copy link

mtaufen commented Feb 23, 2026

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 23, 2026
@vicentefb
Copy link
Member Author

/assign @janetkuo

// created from this template. A single shared NetworkPolicy is created per Template.
// If this field is omitted (nil), the controller applies a SECURE DEFAULT policy:
// - Ingress: Allow traffic ONLY from the Sandbox Router. All other ingress is denied.
// - Egress: Allow DNS Only (UDP/TCP Port 53). All other traffic (including Internet and Metadata Server) is blocked.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, and PR description needs to be updated as well

@@ -549,65 +554,109 @@ func (r *SandboxClaimReconciler) SetupWithManager(mgr ctrl.Manager) error {
func (r *SandboxClaimReconciler) reconcileNetworkPolicy(ctx context.Context, claim *extensionsv1alpha1.SandboxClaim, template *extensionsv1alpha1.SandboxTemplate) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this change, network policy becomes a shared resource, and having every individual SandboxClaim reconciliation attempt to CreateOrUpdate this shared object can lead to significant 409 conflicts from the API server, and unnecessary Get calls when scaling to thousands of sandboxes.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be cached locally in an informer to avoid the contention?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've replaced CreateOrUpdate with a cached Get -> Create flow. The controller now checks the local informer cache first. If the policy exists, it assumes it's valid and moves on. If it's missing, it attempts to Create it, safely catching and ignoring IsAlreadyExists errors to handle race conditions between claims. This eliminates the 409 conflicts and unnecessary Update calls.

"172.16.0.0/12", // Block Private Class B
"192.168.0.0/16", // Block Private Class C
"169.254.0.0/16", // Block Link-Local (Metadata Server)
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some Kubernetes environments use non-standard ranges for Pod or Service CIDRs. Consider making them configurable or dynamically detecting the cluster's private ranges to avoid bypasses.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all already configurable via the sandbox template, aren't they?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, our strategy here is to provide a "Secure by Default" baseline targeting standard RFC1918 private ranges. If a cluster administrator is running non-standard IP ranges (like public IPs for pods), they can easily override this default by explicitly defining the networkPolicy block in their SandboxTemplate. I've added a note to the README mentioning this exact scenario.

const (
poolLabel = "agents.x-k8s.io/pool"
sandboxTemplateRefHash = "agents.x-k8s.io/sandbox-template-ref-hash"
sandboxTemplateRefHash = "agents.x-k8s.io/template-name-hash"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not safe to rename labels. How are old labels on existing resources handled?

// Public Internet Access (Strict Isolation)
// This rule allows all ports to PUBLIC IPs, but explicitly blocks private LAN ranges.
// NOTE: This intentionally blocks internal cluster DNS (CoreDNS) by default to prevent
// agents from probing for service discovery and leaking internal service names.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause sandboxes to fail any domain resolution by default and potentially breaking most agent workloads. Being so strict by default could potentially have the risk of users just copy-pasting allow-all policies just to make their agents work.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should still allow resolution of public domains by default.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm much more worried about all agents ending up in an allow-all posture by default than I am about users overriding with overly-permissive policy in their sandbox templates, though it's a valid concern. With the latter at least they had to explicitly open the door.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fix this without allowing internal DNS enumeration, I've added a secure default to createSandbox. If the user hasn't specified a DNSPolicy, we automatically inject DNSPolicy: None and set the Nameservers to ["8.8.8.8", "1.1.1.1"].

This completely bypasses CoreDNS. The agent can instantly resolve public internet domains out of the box, but it is physically incapable of resolving internal .cluster.local addresses.

// NetworkPolicy defines the network policy to be applied to the sandboxes
// created from this template.
// created from this template. A single shared NetworkPolicy is created per Template.
// If this field is omitted (nil), the controller applies a SECURE DEFAULT policy:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change. It requires a release note if we need to change the default.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the change policy in this repo? AIUI this is all currently alpha/experimental?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though it still alpha, we should still communicate breaking changes to users to avoid surprises.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, sounds good. What's the process for adding release notes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the release-note-action-required label. This can follow #191 to add a block that will be published in the release notes.

@janetkuo janetkuo added the release-note-action-required Denotes a PR that introduces potentially breaking changes that require user action. label Feb 24, 2026
nit

update

update to fix e2e tests

policy per template

nit

updated api comments

update

update

update

nit

updated

update

update

update
@vicentefb vicentefb force-pushed the defaultNetworkPolicy branch from b5d6e64 to 2725b31 Compare February 25, 2026 20:49
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 25, 2026
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aditya-shantanu, mtaufen, vicentefb
Once this PR has been reviewed and has the lgtm label, please ask for approval from janetkuo. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Comment on lines +596 to +597
if template.Spec.NetworkPolicy == nil {
np.Spec = buildDefaultNetworkPolicySpec(template.Name)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Add a feature toggle

This change forces people to use a shared network policy for the whole pool, which might not be desirable.
For example: we are currently using a network policy per sandbox to be able to control something like internet connectivity for each sandbox individually. So having a shared policy that is generated by default will cause some interesting side effects in our system.

If we want to stay with building default specs, I would at least recommend making it a feature toggle, so that people can opt out if they want/need to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-action-required Denotes a PR that introduces potentially breaking changes that require user action. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Implement default Secure Network Policy

7 participants