feat: Allow Fileownership change through FSGroup and VOLUME_MOUNT_GROUP by mytreya-rh · Pull Request #1841 · kubernetes-sigs/secrets-store-csi-driver

mytreya-rh · 2025-06-09T17:49:45Z

What type of PR is this?
/kind feature

What this PR does / why we need it:
(As of now pls consider it as a draft PR to discuss the solution further.)
Allows the secrets to be mounted with FSGroup as specified in the POD spec.
Thus, A pod with a non-root user should be able to read a secret, and that secret need not be world-readable.

Which issue(s) this PR fixes :
Fixes #858

Is this a chart or deployment yaml update?
There is a yaml update for secrets-store.csi.x-k8s.io_secretproviderclasspodstatuses.yaml (generated through make manifests).
It is added in the manifest_staging/deploy
But if this PR merges after: #1622, the change in SecretProviderClassPodStatusStatus won't be required anymore and we can revert the changes related to reconciler.

Special notes for your reviewer:

Problem:

As of now, all secrets mounted by ss-csi driver are with root:root ownership, and this is unwieldy because it would require much wider permissions on the secret, or elevated privileges on the workload
The standard way to set file ownership of volume mounts is through FSGroup.
However, that does not work with SS-CSI Driver even if FSGroupPolicy is set to "File" because:
- https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/csi/csi_mounter.go#L454
  - Here, ‘readonly’ volumes are skipped from being chown’d by csi_mounter even if the CSIDriver CRD is created with FSGroupPolicy of “File”
  - This part executes in the kubelet while it tries to provision the volume for the POD.
- https://github.com/kubernetes-sigs/secrets-store-csi-driver/blob/main/pkg/secrets-store/nodeserver.go#L205
  - It is not possible to create a readwrite volume for ss-csi driver, and rightly so. Because, the secrets are only supposed to be read.

Solution:

outline
Do the ownership change from within the driver by advertising the VOLUME_MOUNT_GROUP capability.

Notes:

The changes also include secret rotation based on SecretProviderClassPodStatusStatus, but will be reverted if feat: Use RequiresRepublish for secret rotation #1622 merges earlier
In addition, pulled up some of common repetitive code from the unit and e2e tests to make them a bit more terse

tests added in e2e-provider:

(leaving in the test status and runtime just for reference)

ok 16 Non-root POD with no FSGroup - create in 871ms
ok 17 Non-root POD with no FSGroup - Should fail to read non world readable secret in 186ms
ok 18 Non-root POD with no FSGroup - unmount succeeds in 10143ms
ok 19 Non-root POD with FSGroup - create in 1439ms
ok 20 Non-root POD with FSGroup - should read non world readable secret in 202ms
ok 21 Non-root POD with FSGroup - rotated secret should also be readable in 37119ms
ok 22 Non-root POD with FSGroup - unmount succeeds in 10177ms

unit tests:

nodeserver_test
- TestNodePublishVolume_Errors/Invalid_FSGroup
- TestNodePublishVolume/volume_mount_with_valid_FSGroup
reconciler_test
- TestReconcileError/failed_to_parse_FSGroup
- TestReconcileNoError/reconcile_with_FSGroup

TODOs:

squashed commits
includes documentation
adds unit tests

linux-foundation-easycla · 2025-06-09T17:49:49Z

The committers listed above are authorized under a signed CLA.

✅ login: mytreya-rh / name: Mytreya Kasturi (b4862da)

k8s-ci-robot · 2025-06-09T17:49:54Z

Welcome @mytreya-rh!

It looks like this is your first PR to kubernetes-sigs/secrets-store-csi-driver 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/secrets-store-csi-driver has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2025-06-09T17:49:55Z

Hi @mytreya-rh. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

dobsonj · 2025-06-27T13:04:21Z

/ok-to-test

mytreya-rh · 2025-06-30T05:56:05Z

/retest

aramase

The windows job failures are related to this PR.

E0630 16:56:06.282028   10108 atomic_writer.go:419] "unable to change file with owner" err="chown c:\\var\\lib\\kubelet\\pods\\ff425598-c3fa-480d-a6af-814831673629\\volumes\\kubernetes.io~csi\\secrets-store-inline\\mount\\..2025_06_30_16_56_06.1168464543\\secretalias: not supported by windows" logContext="secrets-store-csi-driver" fullPath="c:\\var\\lib\\kubelet\\pods\\ff425598-c3fa-480d-a6af-814831673629\\volumes\\kubernetes.io~csi\\secrets-store-inline\\mount\\..2025_06_30_16_56_06.1168464543\\secretalias" owner=-1

ref: https://storage.googleapis.com/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_secrets-store-csi-driver/1841/pull-secrets-store-csi-driver-e2e-windows/1939725228552753152/artifacts/2025-06-30T170106/secrets-store.log

mytreya-rh · 2025-06-30T19:49:06Z

The windows job failures are related to this PR.
E0630 16:56:06.282028   10108 atomic_writer.go:419] "unable to change file with owner" err="chown c:\\var\\lib\\kubelet\\pods\\ff425598-c3fa-480d-a6af-814831673629\\volumes\\kubernetes.io~csi\\secrets-store-inline\\mount\\..2025_06_30_16_56_06.1168464543\\secretalias: not supported by windows" logContext="secrets-store-csi-driver" fullPath="c:\\var\\lib\\kubelet\\pods\\ff425598-c3fa-480d-a6af-814831673629\\volumes\\kubernetes.io~csi\\secrets-store-inline\\mount\\..2025_06_30_16_56_06.1168464543\\secretalias" owner=-1
ref: https://storage.googleapis.com/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_secrets-store-csi-driver/1841/pull-secrets-store-csi-driver-e2e-windows/1939725228552753152/artifacts/2025-06-30T170106/secrets-store.log

Thanks @aramase !
Pushed a commit to skip Chown on Windows. Guess this is inline with FSGroup behavior as well on Windows nodes.

mytreya-rh · 2025-07-01T01:28:11Z

/retest

dobsonj

I have one comment on test/bats/e2e-provider.bats, but otherwise LGTM. It's a useful fix, implementation looks correct, good test coverage, and passing CI tests. Netlify is warning about a line unrelated from your changes.

dobsonj · 2025-07-11T17:59:29Z

test/bats/e2e-provider.bats

+  kubectl wait -n rotation --for=condition=Ready --timeout=60s pod ${curl_pod_name}
+  local pod_ip=$(kubectl get pod -n kube-system -l app=csi-secrets-store-e2e-provider -o jsonpath="{.items[0].status.podIP}")
+  run kubectl exec ${curl_pod_name} -n rotation -- curl http://${pod_ip}:8080/rotation?rotated=true
+  sleep 35 # 30 is poll interval, 5 second grace should be enough


I worry that 35 seconds may not be enough to prevent flakes. In @test "Test auto rotation of mount contents and K8s secrets" (line 472) it used to sleep 60 seconds, but now it only sleeps 35 seconds? Is it possible for a reconcile loop to be delayed for some reason that would cause this to take longer than 35?

I would probably not reduce this below 60, we had one similar case in vault.bats waiting on secret rotation where we had to increase it to 120 to improve the pass rate.

Thanks @dobsonj
Reverted the sleep back to 60s

mytreya-rh · 2025-07-21T04:51:05Z

/retest

dobsonj · 2025-07-21T19:50:34Z

/lgtm

/sig storage
/triage accepted
/priority important-soon

/assign @aramase
for approval and to decide which PR should merge first between #1841 and #1622

mytreya-rh · 2025-12-12T07:04:02Z

/retest

Looks like infra issues:
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_secrets-store-csi-driver/1841/pull-secrets-store-csi-driver-e2e-azure/1999365595329466368#1:build-log.txt%3A1495

helm.go:92: 2025-12-12 06:37:38.2738949 +0000 UTC m=+0.187530371 [debug] Get "https://sscsi-e2e--sscsi-e2e-9e5f-46678f-9lqficf8.hcp.uksouth.azmk8s.io:443/version": dial tcp: lookup sscsi-e2e--sscsi-e2e-9e5f-46678f-9lqficf8.hcp.uksouth.azmk8s.io on 172.20.0.10:53: no such host
Kubernetes cluster unreachable

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_secrets-store-csi-driver/1841/pull-secrets-store-csi-driver-e2e-windows/1999365595853754368#1:build-log.txt%3A18

ERROR: (Canceled) Operation is being canceled by user

mytreya-rh · 2026-02-24T15:31:18Z

/retest

as failure in windows job seems to be infra related:

tcp: lookup sscsi-e2e--sscsi-e2e-8ab1-46678f-qgavjfpt.hcp.eastus.azmk8s.io on 172.20.0.10:53: no such host
Unable to connect to the server: dial tcp: lookup sscsi-e2e--sscsi-e2e-8ab1-46678f-qgavjfpt.hcp.eastus.azmk8s.io on 172.20.0.10:53: no such host

mytreya-rh · 2026-02-25T04:32:59Z

/retest
looks like an infra error:

E0224 15:46:58.618759 33163 reflector.go:204] "Failed to watch" err="failed to list *unstructured.Unstructured: Get "https://sscsi-e2e--sscsi-e2e-25c2-46678f-ync6prxe.hcp.eastus2.azmk8s.io:443/apis/batch/v1/namespaces/kube-system/jobs?fieldSelector=metadata.name%3Dsecrets-store-csi-driver-upgrade-crds&resourceVersion=4266\": dial tcp: lookup sscsi-e2e--sscsi-e2e-25c2-46678f-ync6prxe.hcp.eastus2.azmk8s.io on 172.20.0.10:53: no such host" logger="UnhandledError" reflector="k8s.io/client-go@v0.35.0/tools/cache/reflector.go:289" type="*unstructured.Unstructured"

mytreya-rh · 2026-02-27T14:14:17Z

/retest
looks like an infra issue during helm based installation of the driver:

E0225 04:47:54.485408 33865 reflector.go:204] "Failed to watch" err="failed to list *unstructured.Unstructured: Get "https://sscsi-e2e--sscsi-e2e-b1fa-46678f-3x5rjnjr.hcp.eastus2.azmk8s.io:443/apis/batch/v1/namespaces/kube-system/jobs?fieldSelector=metadata.name%3Dsecrets-store-csi-driver-upgrade-crds&resourceVersion=4137": dial tcp: lookup sscsi-e2e--sscsi-e2e-b1fa-46678f-3x5rjnjr.hcp.eastus2.azmk8s.io on 172.20.0.10:53: no such host" logger="UnhandledError" reflector="k8s.io/client-go@v0.35.0/tools/cache/reflector.go:289" type="*unstructured.Unstructured"

aramase

Another pass.

aramase · 2026-03-16T20:42:27Z

pkg/constants/constants.go

+/*
+Copyright 2025 The Kubernetes Authors.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+	http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+*/


Let's use the new boilerplate without the year.

got rid of this package altogether as per later suggestion.

aramase · 2026-03-16T20:49:39Z

pkg/util/fileutil/atomic_writer.go

 		}

-		if fileProjection.FsUser == nil {
+		if fileProjection.FsGroup == nil || runtimeutil.IsRuntimeWindows() {


FsGroup is always set to &gid even when gid == constants.NoGID (-1). This means FsGroup is never nil, so the nil check never triggers on Linux -- every mount without FSGroup will still call os.Chown(path, -1, -1) on every file. This is unnecessary syscall overhead.

maybe something like:

fp := FileProjection{ Data: payload.GetContents(), Mode: payload.GetMode(), } if gid != constants.NoGID { fp.FsGroup = &gid } files[payload.GetPath()] = fp

Yes, now Updated the FSGroup population such that only valid GIDs get assinged

aramase · 2026-03-16T20:52:52Z

pkg/util/fileutil/filesystem.go

+		return constants.NoGID, nil
+	}
+	// Non-sentinel negative GID is invalid and thus we use ParseUint here.
+	gid, err := strconv.ParseUint(fsGroupStr, 10, 63)


strconv.ParseUint(fsGroupStr, 10, 63) accepts GIDs up to 2^63 - 1, but valid Linux GIDs max out at 2^32 - 1. Consider using bit size 32 for tighter validation or add a comment explaining why 63 was chosen.

Changed the type for FSGroup/GID to be generic 'int' through out.
Didn't switch to 32 bit as the kubelet makes allowance for 64bit values: https://github.com/kubernetes/kubernetes/blob/b910026535af2d8a64d45efefeb8d9efb75a4817/pkg/volume/csi/csi_client.go#L64
This way, we are not assuming anything about the valid size as the chown API also considers a gid to be generic int type.

aramase · 2026-03-16T20:54:16Z

pkg/secrets-store/nodeserver_test.go

+	}
+	customize(request)
+	return request
+}


Suggested change

}

}

nit

aramase · 2026-03-16T20:55:15Z

pkg/secrets-store/nodeserver_test.go

 			},
 		},
 		{
-			name: "volume mount with rotation but skipped",


why are we removing this test?

Thanks for catching this. Guess i lost it during rebase.
Re-included the test.

aramase · 2026-03-16T20:57:16Z

test/bats/e2e-provider.bats

+  # On Windows, the failed unmount calls from: https://github.com/kubernetes-sigs/secrets-store-csi-driver/pull/545
+  # do not prevent the pod from being deleted. Search through the driver logs
+  # for the error.
+  run bash -c "kubectl -n $NAMESPACE logs -l app=$POD_NAME --tail -1 -c secrets-store -n kube-system | grep '^E.*failed to clean and unmount target path.*$'"


Suggested change

run bash -c "kubectl -n $NAMESPACE logs -l app=$POD_NAME --tail -1 -c secrets-store -n kube-system | grep '^E.*failed to clean and unmount target path.*$'"

run bash -c "kubectl -n $NAMESPACE logs -l app=$POD_NAME --tail -1 -c secrets-store | grep '^E.*failed to clean and unmount target path.*$'"

the second -n wins, so -n $NAMESPACE would be dead code.

aramase · 2026-03-16T20:57:33Z

test/bats/e2e-provider.bats

 # default key value returned by mock provider.
 # base64 encoded content comparision is easier in case of very long multiline string.
 export KEY_VALUE_CONTAINS=${KEY_VALUE:-"LS0tLS1CRUdJTiBQVUJMSUMgS0VZLS0tLS0KVGhpcyBpcyBtb2NrIGtleQotLS0tLUVORCBQVUJMSUMgS0VZLS0tLS0K"}
+# defualt version value returned by mock provider


Suggested change

# defualt version value returned by mock provider

# default version value returned by mock provider

corrected all the three occurrences of this typo, thanks!

aramase · 2026-03-16T20:58:36Z

pkg/constants/constants.go

This package contains a single constant (NoGID) used only in the context of file operations. Consider placing it in pkg/util/fileutil instead to avoid a new package for one constant.

yes, moved the lone constant to fileutil

aramase · 2026-03-16T21:01:18Z

test/bats/e2e-provider.bats

+  assert_failure
+}
+
+function enable_secret_rotation() {


This function creates a local curl_pod_name but never echos it. Callers do curl_pod_name=$(enable_secret_rotation), which captures all stdout from kubectl run, kubectl wait, and curl -- not just the pod name. Then disable_secret_rotation $curl_pod_name receives garbage.

Add echo "$curl_pod_name" at the end of the function and suppress stdout on the intermediate commands.

Thanks for catching this.
Redirected the stdout of other commands to /dev/null
echoed the curl_pod_name
(Verified with a local test that the return value is just the pod name. Didn't check it in though)

mytreya-rh · 2026-03-20T02:48:57Z

/retest

looks like a transient error on the failed jobs:

#3 [internal] load metadata for registry.k8s.io/build-image/debian-base:bookworm-v1.0.6
#3 ERROR: unexpected status from HEAD request to https://us-central1-docker.pkg.dev/v2/k8s-artifacts-prod/images/build-image/debian-base/manifests/bookworm-v1.0.6: 429 Too Many Requests

pkg/util/fileutil/atomic_writer.go

 		}
-		if err := os.Chown(fullPath, int(*fileProjection.FsUser), -1); err != nil {
-			klog.ErrorS(err, "unable to change file with owner", "logContext", w.logContext, "fullPath", fullPath, "owner", int(*fileProjection.FsUser))
+		if err := os.Chown(fullPath, -1, int(*fileProjection.FsGroup)); err != nil {


pkg/util/fileutil/atomic_writer.go

-		if err := os.Chown(fullPath, int(*fileProjection.FsUser), -1); err != nil {
-			klog.ErrorS(err, "unable to change file with owner", "logContext", w.logContext, "fullPath", fullPath, "owner", int(*fileProjection.FsUser))
+		if err := os.Chown(fullPath, -1, int(*fileProjection.FsGroup)); err != nil {
+			klog.ErrorS(err, "unable to change file with owner", "logContext", w.logContext, "fullPath", fullPath, "owner", int(*fileProjection.FsGroup))


codecov-commenter · 2026-03-20T05:06:49Z

Codecov Report

❌ Patch coverage is 66.66667% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 22.31%. Comparing base (19f9876) to head (8618850).
⚠️ Report is 36 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/secrets-store/nodeserver.go	60.00%	6 Missing ⚠️
pkg/util/fileutil/atomic_writer.go	33.33%	2 Missing ⚠️
pkg/util/fileutil/writer.go	71.42%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1841      +/-   ##
==========================================
+ Coverage   21.47%   22.31%   +0.83%     
==========================================
  Files          57       57              
  Lines        3269     3218      -51     
==========================================
+ Hits          702      718      +16     
+ Misses       2476     2407      -69     
- Partials       91       93       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mytreya-rh · 2026-03-20T07:19:50Z

/retest
looks like transient error:

error: Internal error occurred: error sending request: Post "https://10.224.0.33:10250/exec/test-ns/busybox-deployment-78b5c7bdf9-22sf4/busybox?command=cat&command=%!F(MISSING)mnt%!F(MISSING)secrets-store%!F(MISSING)secretalias&error=1&output=1": proxy error from localhost:9443 while dialing 10.224.0.33:10250, code 500: 500 Internal Server Error

aramase

This is close.

aramase · 2026-03-25T20:15:03Z

pkg/secrets-store/nodeserver.go

 	"k8s.io/klog/v2"
 	mount "k8s.io/mount-utils"
 	"sigs.k8s.io/controller-runtime/pkg/client"
+	internalerrors "sigs.k8s.io/secrets-store-csi-driver/pkg/errors"


nit: group commits

import ( stdlib internal external )

Done, makes it better organized. Thanks

aramase · 2026-03-25T20:16:29Z

pkg/util/fileutil/filesystem.go

+	if len(fsGroupStr) == 0 {
+		return NoGID, nil
+	}
+	return strconv.Atoi(fsGroupStr)


strconv.Atoi accepts negative values. The test even validates -23 as a valid GID. A negative GID other than -1 passed to os.Chown is undefined behavior on Linux. Kubelet should never send a negative value, but we should still reject it here.

func ParseFSGroup(fsGroupStr string) (int, error) { if len(fsGroupStr) == 0 { return NoGID, nil } gid, err := strconv.Atoi(fsGroupStr) if err != nil { return NoGID, err } if gid < 0 { return NoGID, fmt.Errorf("invalid FSGroup: %d must be non-negative", gid) } return gid, nil }

Update the negative gid test case to expect an error.

Agree, and done.
Earlier, my intention was to keep complete type compatibility/extensibility, and let the implementation (os.Chown) handle the full range that it supports, but as fsGroup is validated at API to be in range: 0 to 2147483647, disallowing the negative values like you suggested.

aramase · 2026-03-25T20:40:23Z

test/bats/e2e-provider.bats

+  # On Windows, the failed unmount calls from: https://github.com/kubernetes-sigs/secrets-store-csi-driver/pull/545
+  # do not prevent the pod from being deleted. Search through the driver logs
+  # for the error.
+  run bash -c "kubectl -n $NAMESPACE logs -l app=$POD_NAME --tail -1 -c secrets-store | grep '^E.*failed to clean and unmount target path.*$'"


-l app=$POD_NAME doesn't match anything — the test pods don't have that label. This means kubectl logs returns empty, grep always fails, and assert_failure always passes. The unmount error check is effectively a no-op.

Should be the driver DaemonSet pods:

Suggested change

run bash -c "kubectl -n $NAMESPACE logs -l app=$POD_NAME --tail -1 -c secrets-store | grep '^E.*failed to clean and unmount target path.*$'"

run bash -c "kubectl logs -l app=secrets-store-csi-driver --tail -1 -c secrets-store -n kube-system | grep '^E.*failed to clean and unmount target path.*$'"

oops, that was such bad refactoring. Thanks for catching it, corrected.

aramase · 2026-03-25T20:41:42Z

pkg/secrets-store/nodeserver_test.go

+			csiPodName:            "pod1",
+			csiPodNamespace:       "default",
+			csiPodUID:             "poduid1",
+		},


nit: default VolumeCapability has nil AccessType. Works today because GetMount() on nil returns nil and GetVolumeMountGroup() on nil returns "". Fragile if we ever add a nil guard.

VolumeCapability: &csi.VolumeCapability{ AccessType: &csi.VolumeCapability_Mount{ Mount: &csi.VolumeCapability_MountVolume{}, }, },

aramase · 2026-03-25T20:43:13Z

test/bats/e2e-provider.bats

+  # enable rotation response in mock server
+  local curl_pod_name=curl-$(openssl rand -hex 5)
+  kubectl run ${curl_pod_name} -n rotation --image=curlimages/curl:7.75.0 --labels="test=rotation" -- tail -f /dev/null > /dev/null
+  kubectl wait -n rotation --for=condition=Ready --timeout=60s pod ${curl_pod_name} > /dev/null


Suggested change

kubectl wait -n rotation --for=condition=Ready --timeout=60s pod ${curl_pod_name} > /dev/null

kubectl wait -n rotation --for=condition=Ready --timeout=60s pod ${curl_pod_name}

i think we need the redirection to /dev/null so that the function just returns the curl_pod_name right?
ex:

$k wait -n kube-system --for=condition=Ready --timeout=60s pod coredns-6f6b679f8f-6vqqg pod/coredns-6f6b679f8f-6vqqg condition met $

$ k wait -n kube-system --for=condition=Ready --timeout=60s pod coredns-6f6b679f8f-6vqqg >/dev/null $

aramase · 2026-03-25T20:43:27Z

test/bats/e2e-provider.bats

+  kubectl run ${curl_pod_name} -n rotation --image=curlimages/curl:7.75.0 --labels="test=rotation" -- tail -f /dev/null > /dev/null
+  kubectl wait -n rotation --for=condition=Ready --timeout=60s pod ${curl_pod_name} > /dev/null


If kubectl run or kubectl wait fails, the function silently continues and you get a confusing downstream failure. Add || return 1 after the critical commands.

Agree, now returning 1 on kubectl errors, so that the function call results in error in the caller's scope

aramase · 2026-03-25T20:44:11Z

pkg/util/fileutil/atomic_writer.go

-	FsUser *int64
+	Data    []byte
+	Mode    int32
+	FsGroup *int


nit: upstream uses FsUser *int64. This changes both the name and type — both intentional. Add a short comment noting the divergence so future readers don't think it drifted by accident.

Actually added the comment in the file header. and now added reasoning for type change as well.
Shall i move it to the struct definition instead?

mytreya-rh · 2026-03-27T03:05:11Z

/retest
looks like infra issue

Error: INSTALLATION FAILED: Kubernetes cluster unreachable: Get "https://sscsi-e2e--sscsi-e2e-86ed-46678f-t4lfap2y.hcp.uksouth.azmk8s.io:443/version": dial tcp: lookup sscsi-e2e--sscsi-e2e-86ed-46678f-t4lfap2y.hcp.uksouth.azmk8s.io on 172.20.0.10:53: no such host

aramase

Hopefully last set of comments.

aramase · 2026-03-27T20:47:27Z

test/bats/e2e-provider.bats

+  # On Windows, the failed unmount calls from: https://github.com/kubernetes-sigs/secrets-store-csi-driver/pull/545
+  # do not prevent the pod from being deleted. Search through the driver logs
+  # for the error.
+  run bash -c "kubectl -n $NAMESPACE logs -l app=secrets-store-csi-driver --tail -1 -c secrets-store | grep '^E.*failed to clean and unmount target path.*$'"


-n $NAMESPACE queries the test namespace (e.g. default, test-v1alpha1), but the driver DaemonSet pods run in kube-system. This means kubectl logs finds no pods in the test namespace, grep fails, and assert_failure always passes — making this
check a no-op.

The original code had -n kube-system:

Suggested change

run bash -c "kubectl -n $NAMESPACE logs -l app=secrets-store-csi-driver --tail -1 -c secrets-store | grep '^E.*failed to clean and unmount target path.*$'"

run bash -c "kubectl logs -l app=secrets-store-csi-driver --tail -1 -c secrets-store -n kube-system | grep '^E.*failed to clean and unmount target path.*$'"

done, thanks again for catching this oversight.
Also simplified the semantics of passing file permissions to create_spc function in this file.

aramase · 2026-03-27T20:48:22Z

test/e2eprovider/server/server.go

+			if err != nil || mode > 511 {
+				return nil, fmt.Errorf("invalid filePermission: %s, error: %w for file: %s", mockSecretsStoreObject.FilePermission, err, mockSecretsStoreObject.ObjectName)
+			}


When mode > 511 but err == nil, this wraps a nil error with %w which prints <nil> in the message. Split the conditions:

if err != nil { return nil, fmt.Errorf("invalid filePermission: %s, error: %w for file: %s", mockSecretsStoreObject.FilePermission, err, mockSecretsStoreObject.ObjectName) } if mode > 511 { return nil, fmt.Errorf("invalid filePermission: %s exceeds 0777 for file: %s", mockSecretsStoreObject.FilePermission, mockSecretsStoreObject.ObjectName) }

Done, also changed mode > 511 to mode > 0o777 for better readability

aramase · 2026-03-27T20:48:59Z

pkg/secrets-store/nodeserver.go

 	}

-	klog.V(2).InfoS("node publish volume", "target", targetPath, "volumeId", volumeID, "mount flags", mountFlags)
+	klog.V(2).InfoS("node publish volume", "target", targetPath, "volumeId", volumeID, "mount flags", mountFlags, "volumeCapabilities", req.VolumeCapability.String())


nit: req.VolumeCapability.String() dumps the entire proto including mount flags, access mode, etc. If the intent is just to log the FSGroup, consider logging mountVol.GetVolumeMountGroup() after parsing it instead. The full capability proto can be noisy
in production logs.

Done, now only logging VolumeMountGroup.
However, not using the parsed value but the value obtained in the NodePublishVolume arguments, as it could help in better debugging if for some reason the parse function is not working as expected.

aramase · 2026-03-27T20:49:48Z

pkg/util/fileutil/atomic_writer.go

 //  * tag: v1.20.6,
 //  * commit: 8a62859e515889f07e3e3be6a1080413f17cf2c3
 //  * link: https://github.com/kubernetes/kubernetes/blob/8a62859e515889f07e3e3be6a1080413f17cf2c3/pkg/volume/util/atomic_writer.go
+// In addition, FileProjection::FSUser has been changed to FileProjection::FSGroup


The header comment is fine but could you also add a one-liner at the struct itself? That's where people will look when they see FsGroup *int and wonder why it doesn't match upstream's FsUser *int64:

// FileProjection contains file Data and access Mode. // FsGroup diverges from upstream's FsUser (*int64) — see file header for rationale. type FileProjection struct {

mytreya-rh · 2026-03-30T11:50:46Z

/retest
looks like an environment issue in the AWS Provider:

E0330 11:11:23.049879 1 nodeserver.go:253] "failed to mount secrets store object content" err="rpc error: code = Unknown desc = Failed to fetch parameters from all regions." pod="kube-system/basic-test-mount" isRemountRequest=false
I0330 11:11:23.049919 1 nodeserver.go:86] "unmounting target path as node publish volume failed" targetPath="/var/lib/kubelet/pods/e021597b-0a10-48cf-811f-c98114044231/volumes/kubernetes.io~csi/secrets-store-inline/mount" pod="kube-system/basic-test-mount"

mytreya-rh · 2026-03-31T04:01:55Z

/retest
seems to be an infra issue or IRSA configuration error. If this fails, will add some debug in next commit

0330 12:07:02.492849 1 nodeserver.go:253] "failed to mount secrets store object content" err="rpc error: code = Unknown desc = Failed to fetch secret from all regions. Verify secret exists and required permissions are granted for: SecretsManagerRotationTest-secret-d37d9220da61" pod="kube-system/basic-test-mount" isRemountRequest=false

mytreya-rh · 2026-03-31T16:36:46Z

/hold
debugging the failing [aws e2e job

mytreya-rh · 2026-04-01T04:46:49Z

/unhold
aws test was failing with

E0331 16:58:41.618697 1 nodeserver.go:253] "failed to mount secrets store object content" err="rpc error: code = Unknown desc = IRSA token extraction failed: token for audience "sts.amazonaws.com" not found - ensure tokenRequests includes this audience in CSIDriver" pod="kube-system/basic-test-mount" isRemountRequest=false
included fix in Add sts.amazonaws.com audience to tokenRequests as needed by the newer version of AWS Provider

Implements the CSI NodeServiceCapability RPC_VOLUME_MOUNT_GROUP so that mounted secret files are chown'd to the pod's FSGroup. This allows secrets to be not world-readable in non-root containers. 1. nodeServer::NodeGetCapabilities() advertise VOLUME_MOUNT_GROUP 2. nodeServer::NodePublishVolume() get POD's FSGroup if any from: req.VolumeCapability.GetMount().GetVolumeMountGroup() 3. pass the fsgroup onto (writer.go) WritePayloads() 4. include the FSGroup in the FileProjection struct (rename FileProjection::FSUser as FSGroup) 5. change AtomicWriter::writePayloadToDir() to chown the group based on FSGroup 6. Add relevant Unit tests, and e2eprovider tests 7. Bit of refactoring in the unit and e2eprovider tests to make them more terse

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 9, 2025

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 9, 2025

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 9, 2025

k8s-ci-robot requested review from aramase and ritazh June 9, 2025 17:49

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 9, 2025

enj moved this to Subprojects - Needs Triage in SIG Auth Jun 10, 2025

enj added this to SIG Auth Jun 10, 2025

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 27, 2025

mytreya-rh force-pushed the allow_file_ownership branch from 898943d to b797f9d Compare June 29, 2025 16:08

enj moved this from Subprojects - Needs Triage to In Review in SIG Auth Jun 30, 2025

aramase requested changes Jun 30, 2025

View reviewed changes

github-project-automation bot moved this from In Review to Changes Requested in SIG Auth Jun 30, 2025

mytreya-rh force-pushed the allow_file_ownership branch from b797f9d to cbac857 Compare June 30, 2025 19:43

dobsonj reviewed Jul 11, 2025

View reviewed changes

k8s-ci-robot assigned aramase Jul 21, 2025

k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Jul 21, 2025

mytreya-rh force-pushed the allow_file_ownership branch from e47407a to 9495080 Compare December 12, 2025 06:27

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 12, 2025

mytreya-rh force-pushed the allow_file_ownership branch from 9495080 to da34a62 Compare February 24, 2026 14:23

aramase requested changes Mar 16, 2026

View reviewed changes

mytreya-rh force-pushed the allow_file_ownership branch 2 times, most recently from 871e2b8 to 8618850 Compare March 19, 2026 19:26

github-advanced-security bot found potential problems Mar 20, 2026

View reviewed changes

mytreya-rh force-pushed the allow_file_ownership branch from 8618850 to 40f47db Compare March 20, 2026 06:27

aramase requested changes Mar 25, 2026

View reviewed changes

aramase requested changes Mar 27, 2026

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 31, 2026

mytreya-rh force-pushed the allow_file_ownership branch 2 times, most recently from 43f6852 to e08b33d Compare March 31, 2026 17:51

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 1, 2026

mytreya-rh force-pushed the allow_file_ownership branch 2 times, most recently from 41dd89b to 740df77 Compare April 3, 2026 08:07

mytreya-rh force-pushed the allow_file_ownership branch from 740df77 to b4862da Compare April 3, 2026 08:12

	run bash -c "kubectl -n $NAMESPACE logs -l app=$POD_NAME --tail -1 -c secrets-store -n kube-system \| grep '^E.failed to clean and unmount target path.$'"
	run bash -c "kubectl -n $NAMESPACE logs -l app=$POD_NAME --tail -1 -c secrets-store \| grep '^E.failed to clean and unmount target path.$'"

	# defualt version value returned by mock provider
	# default version value returned by mock provider

	run bash -c "kubectl -n $NAMESPACE logs -l app=$POD_NAME --tail -1 -c secrets-store \| grep '^E.failed to clean and unmount target path.$'"
	run bash -c "kubectl logs -l app=secrets-store-csi-driver --tail -1 -c secrets-store -n kube-system \| grep '^E.failed to clean and unmount target path.$'"

	kubectl wait -n rotation --for=condition=Ready --timeout=60s pod ${curl_pod_name} > /dev/null
	kubectl wait -n rotation --for=condition=Ready --timeout=60s pod ${curl_pod_name}

		kubectl run ${curl_pod_name} -n rotation --image=curlimages/curl:7.75.0 --labels="test=rotation" -- tail -f /dev/null > /dev/null
		kubectl wait -n rotation --for=condition=Ready --timeout=60s pod ${curl_pod_name} > /dev/null

Conversation

mytreya-rh commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem:

Solution:

Notes:

tests added in e2e-provider:

unit tests:

Uh oh!

linux-foundation-easycla bot commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Jun 9, 2025

Uh oh!

k8s-ci-robot commented Jun 9, 2025

Uh oh!

dobsonj commented Jun 27, 2025

Uh oh!

mytreya-rh commented Jun 30, 2025

Uh oh!

aramase left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mytreya-rh commented Jun 30, 2025

Uh oh!

mytreya-rh commented Jul 1, 2025

Uh oh!

dobsonj left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mytreya-rh commented Jul 21, 2025

Uh oh!

dobsonj commented Jul 21, 2025

Uh oh!

mytreya-rh commented Dec 12, 2025

Uh oh!

mytreya-rh commented Feb 24, 2026

Uh oh!

mytreya-rh commented Feb 25, 2026

Uh oh!

mytreya-rh commented Feb 27, 2026

Uh oh!

aramase left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mytreya-rh commented Jun 9, 2025 •

edited

Loading

linux-foundation-easycla bot commented Jun 9, 2025 •

edited

Loading

aramase left a comment •

edited

Loading

mytreya-rh commented Mar 20, 2026 •

edited

Loading