Skip to content

feat: Allow Fileownership change through FSGroup and VOLUME_MOUNT_GROUP#1841

Open
mytreya-rh wants to merge 1 commit intokubernetes-sigs:mainfrom
mytreya-rh:allow_file_ownership
Open

feat: Allow Fileownership change through FSGroup and VOLUME_MOUNT_GROUP#1841
mytreya-rh wants to merge 1 commit intokubernetes-sigs:mainfrom
mytreya-rh:allow_file_ownership

Conversation

@mytreya-rh
Copy link
Copy Markdown

@mytreya-rh mytreya-rh commented Jun 9, 2025

What type of PR is this?
/kind feature

What this PR does / why we need it:
(As of now pls consider it as a draft PR to discuss the solution further.)
Allows the secrets to be mounted with FSGroup as specified in the POD spec.
Thus, A pod with a non-root user should be able to read a secret, and that secret need not be world-readable.

Which issue(s) this PR fixes :
Fixes #858

Is this a chart or deployment yaml update?
There is a yaml update for secrets-store.csi.x-k8s.io_secretproviderclasspodstatuses.yaml (generated through make manifests).
It is added in the manifest_staging/deploy
But if this PR merges after: #1622, the change in SecretProviderClassPodStatusStatus won't be required anymore and we can revert the changes related to reconciler.

Special notes for your reviewer:

Problem:

Solution:

  • outline
  • Do the ownership change from within the driver by advertising the VOLUME_MOUNT_GROUP capability.

Notes:

  • The changes also include secret rotation based on SecretProviderClassPodStatusStatus, but will be reverted if feat: Use RequiresRepublish for secret rotation #1622 merges earlier
  • In addition, pulled up some of common repetitive code from the unit and e2e tests to make them a bit more terse

tests added in e2e-provider:

(leaving in the test status and runtime just for reference)
  • ok 16 Non-root POD with no FSGroup - create in 871ms
  • ok 17 Non-root POD with no FSGroup - Should fail to read non world readable secret in 186ms
  • ok 18 Non-root POD with no FSGroup - unmount succeeds in 10143ms
  • ok 19 Non-root POD with FSGroup - create in 1439ms
  • ok 20 Non-root POD with FSGroup - should read non world readable secret in 202ms
  • ok 21 Non-root POD with FSGroup - rotated secret should also be readable in 37119ms
  • ok 22 Non-root POD with FSGroup - unmount succeeds in 10177ms

unit tests:

  • nodeserver_test
    • TestNodePublishVolume_Errors/Invalid_FSGroup
    • TestNodePublishVolume/volume_mount_with_valid_FSGroup
  • reconciler_test
    • TestReconcileError/failed_to_parse_FSGroup
    • TestReconcileNoError/reconcile_with_FSGroup

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 9, 2025
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla bot commented Jun 9, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: mytreya-rh / name: Mytreya Kasturi (b4862da)

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 9, 2025
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Welcome @mytreya-rh!

It looks like this is your first PR to kubernetes-sigs/secrets-store-csi-driver 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/secrets-store-csi-driver has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 9, 2025
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @mytreya-rh. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot requested review from aramase and ritazh June 9, 2025 17:49
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 9, 2025
@enj enj moved this to Subprojects - Needs Triage in SIG Auth Jun 10, 2025
@enj enj added this to SIG Auth Jun 10, 2025
@dobsonj
Copy link
Copy Markdown
Member

dobsonj commented Jun 27, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 27, 2025
@mytreya-rh mytreya-rh force-pushed the allow_file_ownership branch from 898943d to b797f9d Compare June 29, 2025 16:08
@mytreya-rh
Copy link
Copy Markdown
Author

/retest

@enj enj moved this from Subprojects - Needs Triage to In Review in SIG Auth Jun 30, 2025
Copy link
Copy Markdown
Member

@aramase aramase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The windows job failures are related to this PR.

E0630 16:56:06.282028   10108 atomic_writer.go:419] "unable to change file with owner" err="chown c:\\var\\lib\\kubelet\\pods\\ff425598-c3fa-480d-a6af-814831673629\\volumes\\kubernetes.io~csi\\secrets-store-inline\\mount\\..2025_06_30_16_56_06.1168464543\\secretalias: not supported by windows" logContext="secrets-store-csi-driver" fullPath="c:\\var\\lib\\kubelet\\pods\\ff425598-c3fa-480d-a6af-814831673629\\volumes\\kubernetes.io~csi\\secrets-store-inline\\mount\\..2025_06_30_16_56_06.1168464543\\secretalias" owner=-1

ref: https://storage.googleapis.com/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_secrets-store-csi-driver/1841/pull-secrets-store-csi-driver-e2e-windows/1939725228552753152/artifacts/2025-06-30T170106/secrets-store.log

@github-project-automation github-project-automation bot moved this from In Review to Changes Requested in SIG Auth Jun 30, 2025
@mytreya-rh mytreya-rh force-pushed the allow_file_ownership branch from b797f9d to cbac857 Compare June 30, 2025 19:43
@mytreya-rh
Copy link
Copy Markdown
Author

The windows job failures are related to this PR.

E0630 16:56:06.282028   10108 atomic_writer.go:419] "unable to change file with owner" err="chown c:\\var\\lib\\kubelet\\pods\\ff425598-c3fa-480d-a6af-814831673629\\volumes\\kubernetes.io~csi\\secrets-store-inline\\mount\\..2025_06_30_16_56_06.1168464543\\secretalias: not supported by windows" logContext="secrets-store-csi-driver" fullPath="c:\\var\\lib\\kubelet\\pods\\ff425598-c3fa-480d-a6af-814831673629\\volumes\\kubernetes.io~csi\\secrets-store-inline\\mount\\..2025_06_30_16_56_06.1168464543\\secretalias" owner=-1

ref: https://storage.googleapis.com/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_secrets-store-csi-driver/1841/pull-secrets-store-csi-driver-e2e-windows/1939725228552753152/artifacts/2025-06-30T170106/secrets-store.log

Thanks @aramase !
Pushed a commit to skip Chown on Windows. Guess this is inline with FSGroup behavior as well on Windows nodes.

@mytreya-rh
Copy link
Copy Markdown
Author

/retest

Copy link
Copy Markdown
Member

@dobsonj dobsonj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one comment on test/bats/e2e-provider.bats, but otherwise LGTM. It's a useful fix, implementation looks correct, good test coverage, and passing CI tests. Netlify is warning about a line unrelated from your changes.

kubectl wait -n rotation --for=condition=Ready --timeout=60s pod ${curl_pod_name}
local pod_ip=$(kubectl get pod -n kube-system -l app=csi-secrets-store-e2e-provider -o jsonpath="{.items[0].status.podIP}")
run kubectl exec ${curl_pod_name} -n rotation -- curl http://${pod_ip}:8080/rotation?rotated=true
sleep 35 # 30 is poll interval, 5 second grace should be enough
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that 35 seconds may not be enough to prevent flakes. In @test "Test auto rotation of mount contents and K8s secrets" (line 472) it used to sleep 60 seconds, but now it only sleeps 35 seconds? Is it possible for a reconcile loop to be delayed for some reason that would cause this to take longer than 35?

I would probably not reduce this below 60, we had one similar case in vault.bats waiting on secret rotation where we had to increase it to 120 to improve the pass rate.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dobsonj
Reverted the sleep back to 60s

@mytreya-rh
Copy link
Copy Markdown
Author

/retest

@dobsonj
Copy link
Copy Markdown
Member

dobsonj commented Jul 21, 2025

/lgtm

/sig storage
/triage accepted
/priority important-soon

/assign @aramase
for approval and to decide which PR should merge first between #1841 and #1622

@k8s-ci-robot k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Jul 21, 2025
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 12, 2025
@mytreya-rh
Copy link
Copy Markdown
Author

/retest

Looks like infra issues:
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_secrets-store-csi-driver/1841/pull-secrets-store-csi-driver-e2e-azure/1999365595329466368#1:build-log.txt%3A1495

helm.go:92: 2025-12-12 06:37:38.2738949 +0000 UTC m=+0.187530371 [debug] Get "https://sscsi-e2e--sscsi-e2e-9e5f-46678f-9lqficf8.hcp.uksouth.azmk8s.io:443/version": dial tcp: lookup sscsi-e2e--sscsi-e2e-9e5f-46678f-9lqficf8.hcp.uksouth.azmk8s.io on 172.20.0.10:53: no such host
Kubernetes cluster unreachable

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_secrets-store-csi-driver/1841/pull-secrets-store-csi-driver-e2e-windows/1999365595853754368#1:build-log.txt%3A18

ERROR: (Canceled) Operation is being canceled by user

@mytreya-rh
Copy link
Copy Markdown
Author

/retest

as failure in windows job seems to be infra related:

tcp: lookup sscsi-e2e--sscsi-e2e-8ab1-46678f-qgavjfpt.hcp.eastus.azmk8s.io on 172.20.0.10:53: no such host
Unable to connect to the server: dial tcp: lookup sscsi-e2e--sscsi-e2e-8ab1-46678f-qgavjfpt.hcp.eastus.azmk8s.io on 172.20.0.10:53: no such host

@mytreya-rh
Copy link
Copy Markdown
Author

/retest
looks like an infra error:

E0224 15:46:58.618759 33163 reflector.go:204] "Failed to watch" err="failed to list *unstructured.Unstructured: Get "https://sscsi-e2e--sscsi-e2e-25c2-46678f-ync6prxe.hcp.eastus2.azmk8s.io:443/apis/batch/v1/namespaces/kube-system/jobs?fieldSelector=metadata.name%3Dsecrets-store-csi-driver-upgrade-crds&resourceVersion=4266\": dial tcp: lookup sscsi-e2e--sscsi-e2e-25c2-46678f-ync6prxe.hcp.eastus2.azmk8s.io on 172.20.0.10:53: no such host" logger="UnhandledError" reflector="k8s.io/client-go@v0.35.0/tools/cache/reflector.go:289" type="*unstructured.Unstructured"

@mytreya-rh
Copy link
Copy Markdown
Author

/retest
looks like an infra issue during helm based installation of the driver:

E0225 04:47:54.485408 33865 reflector.go:204] "Failed to watch" err="failed to list *unstructured.Unstructured: Get "https://sscsi-e2e--sscsi-e2e-b1fa-46678f-3x5rjnjr.hcp.eastus2.azmk8s.io:443/apis/batch/v1/namespaces/kube-system/jobs?fieldSelector=metadata.name%3Dsecrets-store-csi-driver-upgrade-crds&resourceVersion=4137": dial tcp: lookup sscsi-e2e--sscsi-e2e-b1fa-46678f-3x5rjnjr.hcp.eastus2.azmk8s.io on 172.20.0.10:53: no such host" logger="UnhandledError" reflector="k8s.io/client-go@v0.35.0/tools/cache/reflector.go:289" type="*unstructured.Unstructured"

Copy link
Copy Markdown
Member

@aramase aramase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another pass.

Comment on lines +1 to +15
/*
Copyright 2025 The Kubernetes Authors.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the new boilerplate without the year.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got rid of this package altogether as per later suggestion.

}

if fileProjection.FsUser == nil {
if fileProjection.FsGroup == nil || runtimeutil.IsRuntimeWindows() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FsGroup is always set to &gid even when gid == constants.NoGID (-1). This means FsGroup is never nil, so the nil check never triggers on Linux -- every mount without FSGroup will still call os.Chown(path, -1, -1) on every file. This is unnecessary syscall overhead.

maybe something like:

 fp := FileProjection{
     Data: payload.GetContents(),
     Mode: payload.GetMode(),
 }
 if gid != constants.NoGID {
     fp.FsGroup = &gid
 }
 files[payload.GetPath()] = fp

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, now Updated the FSGroup population such that only valid GIDs get assinged

return constants.NoGID, nil
}
// Non-sentinel negative GID is invalid and thus we use ParseUint here.
gid, err := strconv.ParseUint(fsGroupStr, 10, 63)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strconv.ParseUint(fsGroupStr, 10, 63) accepts GIDs up to 2^63 - 1, but valid Linux GIDs max out at 2^32 - 1. Consider using bit size 32 for tighter validation or add a comment explaining why 63 was chosen.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the type for FSGroup/GID to be generic 'int' through out.
Didn't switch to 32 bit as the kubelet makes allowance for 64bit values: https://github.com/kubernetes/kubernetes/blob/b910026535af2d8a64d45efefeb8d9efb75a4817/pkg/volume/csi/csi_client.go#L64
This way, we are not assuming anything about the valid size as the chown API also considers a gid to be generic int type.

}
customize(request)
return request
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
}

nit

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

},
},
{
name: "volume mount with rotation but skipped",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we removing this test?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. Guess i lost it during rebase.
Re-included the test.

# On Windows, the failed unmount calls from: https://github.com/kubernetes-sigs/secrets-store-csi-driver/pull/545
# do not prevent the pod from being deleted. Search through the driver logs
# for the error.
run bash -c "kubectl -n $NAMESPACE logs -l app=$POD_NAME --tail -1 -c secrets-store -n kube-system | grep '^E.*failed to clean and unmount target path.*$'"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
run bash -c "kubectl -n $NAMESPACE logs -l app=$POD_NAME --tail -1 -c secrets-store -n kube-system | grep '^E.*failed to clean and unmount target path.*$'"
run bash -c "kubectl -n $NAMESPACE logs -l app=$POD_NAME --tail -1 -c secrets-store | grep '^E.*failed to clean and unmount target path.*$'"

the second -n wins, so -n $NAMESPACE would be dead code.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# default key value returned by mock provider.
# base64 encoded content comparision is easier in case of very long multiline string.
export KEY_VALUE_CONTAINS=${KEY_VALUE:-"LS0tLS1CRUdJTiBQVUJMSUMgS0VZLS0tLS0KVGhpcyBpcyBtb2NrIGtleQotLS0tLUVORCBQVUJMSUMgS0VZLS0tLS0K"}
# defualt version value returned by mock provider
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# defualt version value returned by mock provider
# default version value returned by mock provider

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected all the three occurrences of this typo, thanks!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This package contains a single constant (NoGID) used only in the context of file operations. Consider placing it in pkg/util/fileutil instead to avoid a new package for one constant.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, moved the lone constant to fileutil

assert_failure
}

function enable_secret_rotation() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function creates a local curl_pod_name but never echos it. Callers do curl_pod_name=$(enable_secret_rotation), which captures all stdout from kubectl run, kubectl wait, and curl -- not just the pod name. Then disable_secret_rotation $curl_pod_name receives garbage.

Add echo "$curl_pod_name" at the end of the function and suppress stdout on the intermediate commands.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this.
Redirected the stdout of other commands to /dev/null
echoed the curl_pod_name
(Verified with a local test that the return value is just the pod name. Didn't check it in though)

@mytreya-rh mytreya-rh force-pushed the allow_file_ownership branch 2 times, most recently from 871e2b8 to 8618850 Compare March 19, 2026 19:26
@mytreya-rh
Copy link
Copy Markdown
Author

/retest

looks like a transient error on the failed jobs:

#3 [internal] load metadata for registry.k8s.io/build-image/debian-base:bookworm-v1.0.6
#3 ERROR: unexpected status from HEAD request to https://us-central1-docker.pkg.dev/v2/k8s-artifacts-prod/images/build-image/debian-base/manifests/bookworm-v1.0.6: 429 Too Many Requests

}
if err := os.Chown(fullPath, int(*fileProjection.FsUser), -1); err != nil {
klog.ErrorS(err, "unable to change file with owner", "logContext", w.logContext, "fullPath", fullPath, "owner", int(*fileProjection.FsUser))
if err := os.Chown(fullPath, -1, int(*fileProjection.FsGroup)); err != nil {

Check failure

Code scanning / CodeQL

Incorrect conversion between integer types

Incorrect conversion of an unsigned 63-bit integer from [strconv.ParseUint](1) to a lower bit size type int without an upper bound check.
if err := os.Chown(fullPath, int(*fileProjection.FsUser), -1); err != nil {
klog.ErrorS(err, "unable to change file with owner", "logContext", w.logContext, "fullPath", fullPath, "owner", int(*fileProjection.FsUser))
if err := os.Chown(fullPath, -1, int(*fileProjection.FsGroup)); err != nil {
klog.ErrorS(err, "unable to change file with owner", "logContext", w.logContext, "fullPath", fullPath, "owner", int(*fileProjection.FsGroup))

Check failure

Code scanning / CodeQL

Incorrect conversion between integer types

Incorrect conversion of an unsigned 63-bit integer from [strconv.ParseUint](1) to a lower bit size type int without an upper bound check.
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 66.66667% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 22.31%. Comparing base (19f9876) to head (8618850).
⚠️ Report is 36 commits behind head on main.

Files with missing lines Patch % Lines
pkg/secrets-store/nodeserver.go 60.00% 6 Missing ⚠️
pkg/util/fileutil/atomic_writer.go 33.33% 2 Missing ⚠️
pkg/util/fileutil/writer.go 71.42% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1841      +/-   ##
==========================================
+ Coverage   21.47%   22.31%   +0.83%     
==========================================
  Files          57       57              
  Lines        3269     3218      -51     
==========================================
+ Hits          702      718      +16     
+ Misses       2476     2407      -69     
- Partials       91       93       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mytreya-rh mytreya-rh force-pushed the allow_file_ownership branch from 8618850 to 40f47db Compare March 20, 2026 06:27
@mytreya-rh
Copy link
Copy Markdown
Author

mytreya-rh commented Mar 20, 2026

/retest
looks like transient error:

error: Internal error occurred: error sending request: Post "https://10.224.0.33:10250/exec/test-ns/busybox-deployment-78b5c7bdf9-22sf4/busybox?command=cat&command=%!F(MISSING)mnt%!F(MISSING)secrets-store%!F(MISSING)secretalias&error=1&output=1": proxy error from localhost:9443 while dialing 10.224.0.33:10250, code 500: 500 Internal Server Error

Copy link
Copy Markdown
Member

@aramase aramase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is close.

"k8s.io/klog/v2"
mount "k8s.io/mount-utils"
"sigs.k8s.io/controller-runtime/pkg/client"
internalerrors "sigs.k8s.io/secrets-store-csi-driver/pkg/errors"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: group commits

import (
  stdlib

  internal

  external
)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, makes it better organized. Thanks

if len(fsGroupStr) == 0 {
return NoGID, nil
}
return strconv.Atoi(fsGroupStr)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strconv.Atoi accepts negative values. The test even validates -23 as a valid GID. A negative GID other than -1 passed to os.Chown is undefined behavior on Linux. Kubelet should never send a negative value, but we should still reject it here.

func ParseFSGroup(fsGroupStr string) (int, error) {
      if len(fsGroupStr) == 0 {
              return NoGID, nil
      }
      gid, err := strconv.Atoi(fsGroupStr)
      if err != nil {
              return NoGID, err
      }
      if gid < 0 {
              return NoGID, fmt.Errorf("invalid FSGroup: %d must be non-negative", gid)
      }
      return gid, nil
}

Update the negative gid test case to expect an error.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, and done.
Earlier, my intention was to keep complete type compatibility/extensibility, and let the implementation (os.Chown) handle the full range that it supports, but as fsGroup is validated at API to be in range: 0 to 2147483647, disallowing the negative values like you suggested.

# On Windows, the failed unmount calls from: https://github.com/kubernetes-sigs/secrets-store-csi-driver/pull/545
# do not prevent the pod from being deleted. Search through the driver logs
# for the error.
run bash -c "kubectl -n $NAMESPACE logs -l app=$POD_NAME --tail -1 -c secrets-store | grep '^E.*failed to clean and unmount target path.*$'"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-l app=$POD_NAME doesn't match anything — the test pods don't have that label. This means kubectl logs returns empty, grep always fails, and assert_failure always passes. The unmount error check is effectively a no-op.

Should be the driver DaemonSet pods:

Suggested change
run bash -c "kubectl -n $NAMESPACE logs -l app=$POD_NAME --tail -1 -c secrets-store | grep '^E.*failed to clean and unmount target path.*$'"
run bash -c "kubectl logs -l app=secrets-store-csi-driver --tail -1 -c secrets-store -n kube-system | grep '^E.*failed to clean and unmount target path.*$'"

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, that was such bad refactoring. Thanks for catching it, corrected.

csiPodName: "pod1",
csiPodNamespace: "default",
csiPodUID: "poduid1",
},
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: default VolumeCapability has nil AccessType. Works today because GetMount() on nil returns nil and GetVolumeMountGroup() on nil returns "". Fragile if we ever add a nil guard.

              VolumeCapability: &csi.VolumeCapability{
                      AccessType: &csi.VolumeCapability_Mount{
                              Mount: &csi.VolumeCapability_MountVolume{},
                      },
              },

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

# enable rotation response in mock server
local curl_pod_name=curl-$(openssl rand -hex 5)
kubectl run ${curl_pod_name} -n rotation --image=curlimages/curl:7.75.0 --labels="test=rotation" -- tail -f /dev/null > /dev/null
kubectl wait -n rotation --for=condition=Ready --timeout=60s pod ${curl_pod_name} > /dev/null
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kubectl wait -n rotation --for=condition=Ready --timeout=60s pod ${curl_pod_name} > /dev/null
kubectl wait -n rotation --for=condition=Ready --timeout=60s pod ${curl_pod_name}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we need the redirection to /dev/null so that the function just returns the curl_pod_name right?
ex:

$k wait -n kube-system --for=condition=Ready --timeout=60s pod coredns-6f6b679f8f-6vqqg
pod/coredns-6f6b679f8f-6vqqg condition met
$
$ k wait -n kube-system --for=condition=Ready --timeout=60s pod coredns-6f6b679f8f-6vqqg >/dev/null
$

Comment on lines +108 to +109
kubectl run ${curl_pod_name} -n rotation --image=curlimages/curl:7.75.0 --labels="test=rotation" -- tail -f /dev/null > /dev/null
kubectl wait -n rotation --for=condition=Ready --timeout=60s pod ${curl_pod_name} > /dev/null
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If kubectl run or kubectl wait fails, the function silently continues and you get a confusing downstream failure. Add || return 1 after the critical commands.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, now returning 1 on kubectl errors, so that the function call results in error in the caller's scope

FsUser *int64
Data []byte
Mode int32
FsGroup *int
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: upstream uses FsUser *int64. This changes both the name and type — both intentional. Add a short comment noting the divergence so future readers don't think it drifted by accident.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually added the comment in the file header. and now added reasoning for type change as well.
Shall i move it to the struct definition instead?

@mytreya-rh
Copy link
Copy Markdown
Author

/retest
looks like infra issue

Error: INSTALLATION FAILED: Kubernetes cluster unreachable: Get "https://sscsi-e2e--sscsi-e2e-86ed-46678f-t4lfap2y.hcp.uksouth.azmk8s.io:443/version": dial tcp: lookup sscsi-e2e--sscsi-e2e-86ed-46678f-t4lfap2y.hcp.uksouth.azmk8s.io on 172.20.0.10:53: no such host

Copy link
Copy Markdown
Member

@aramase aramase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully last set of comments.

# On Windows, the failed unmount calls from: https://github.com/kubernetes-sigs/secrets-store-csi-driver/pull/545
# do not prevent the pod from being deleted. Search through the driver logs
# for the error.
run bash -c "kubectl -n $NAMESPACE logs -l app=secrets-store-csi-driver --tail -1 -c secrets-store | grep '^E.*failed to clean and unmount target path.*$'"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-n $NAMESPACE queries the test namespace (e.g. default, test-v1alpha1), but the driver DaemonSet pods run in kube-system. This means kubectl logs finds no pods in the test namespace, grep fails, and assert_failure always passes — making this
check a no-op.

The original code had -n kube-system:

Suggested change
run bash -c "kubectl -n $NAMESPACE logs -l app=secrets-store-csi-driver --tail -1 -c secrets-store | grep '^E.*failed to clean and unmount target path.*$'"
run bash -c "kubectl logs -l app=secrets-store-csi-driver --tail -1 -c secrets-store -n kube-system | grep '^E.*failed to clean and unmount target path.*$'"

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thanks again for catching this oversight.
Also simplified the semantics of passing file permissions to create_spc function in this file.

Comment on lines +183 to +185
if err != nil || mode > 511 {
return nil, fmt.Errorf("invalid filePermission: %s, error: %w for file: %s", mockSecretsStoreObject.FilePermission, err, mockSecretsStoreObject.ObjectName)
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When mode > 511 but err == nil, this wraps a nil error with %w which prints <nil> in the message. Split the conditions:

if err != nil {
    return nil, fmt.Errorf("invalid filePermission: %s, error: %w for file: %s", mockSecretsStoreObject.FilePermission, err, mockSecretsStoreObject.ObjectName)
}
if mode > 511 {
    return nil, fmt.Errorf("invalid filePermission: %s exceeds 0777 for file: %s", mockSecretsStoreObject.FilePermission, mockSecretsStoreObject.ObjectName)
}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, also changed mode > 511 to mode > 0o777 for better readability

}

klog.V(2).InfoS("node publish volume", "target", targetPath, "volumeId", volumeID, "mount flags", mountFlags)
klog.V(2).InfoS("node publish volume", "target", targetPath, "volumeId", volumeID, "mount flags", mountFlags, "volumeCapabilities", req.VolumeCapability.String())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: req.VolumeCapability.String() dumps the entire proto including mount flags, access mode, etc. If the intent is just to log the FSGroup, consider logging mountVol.GetVolumeMountGroup() after parsing it instead. The full capability proto can be noisy
in production logs.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, now only logging VolumeMountGroup.
However, not using the parsed value but the value obtained in the NodePublishVolume arguments, as it could help in better debugging if for some reason the parse function is not working as expected.

// * tag: v1.20.6,
// * commit: 8a62859e515889f07e3e3be6a1080413f17cf2c3
// * link: https://github.com/kubernetes/kubernetes/blob/8a62859e515889f07e3e3be6a1080413f17cf2c3/pkg/volume/util/atomic_writer.go
// In addition, FileProjection::FSUser has been changed to FileProjection::FSGroup
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header comment is fine but could you also add a one-liner at the struct itself? That's where people will look when they see FsGroup *int and wonder why it doesn't match upstream's FsUser *int64:

// FileProjection contains file Data and access Mode.
// FsGroup diverges from upstream's FsUser (*int64) — see file header for rationale.
type FileProjection struct {

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@mytreya-rh
Copy link
Copy Markdown
Author

/retest
looks like an environment issue in the AWS Provider:

E0330 11:11:23.049879 1 nodeserver.go:253] "failed to mount secrets store object content" err="rpc error: code = Unknown desc = Failed to fetch parameters from all regions." pod="kube-system/basic-test-mount" isRemountRequest=false
I0330 11:11:23.049919 1 nodeserver.go:86] "unmounting target path as node publish volume failed" targetPath="/var/lib/kubelet/pods/e021597b-0a10-48cf-811f-c98114044231/volumes/kubernetes.io~csi/secrets-store-inline/mount" pod="kube-system/basic-test-mount"

@mytreya-rh
Copy link
Copy Markdown
Author

/retest
seems to be an infra issue or IRSA configuration error. If this fails, will add some debug in next commit

0330 12:07:02.492849 1 nodeserver.go:253] "failed to mount secrets store object content" err="rpc error: code = Unknown desc = Failed to fetch secret from all regions. Verify secret exists and required permissions are granted for: SecretsManagerRotationTest-secret-d37d9220da61" pod="kube-system/basic-test-mount" isRemountRequest=false

@mytreya-rh
Copy link
Copy Markdown
Author

/hold
debugging the failing [aws e2e job

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 31, 2026
@mytreya-rh mytreya-rh force-pushed the allow_file_ownership branch 2 times, most recently from 43f6852 to e08b33d Compare March 31, 2026 17:51
@mytreya-rh
Copy link
Copy Markdown
Author

/unhold
aws test was failing with

E0331 16:58:41.618697 1 nodeserver.go:253] "failed to mount secrets store object content" err="rpc error: code = Unknown desc = IRSA token extraction failed: token for audience "sts.amazonaws.com" not found - ensure tokenRequests includes this audience in CSIDriver" pod="kube-system/basic-test-mount" isRemountRequest=false
included fix in Add sts.amazonaws.com audience to tokenRequests as needed by the newer version of AWS Provider

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 1, 2026
@mytreya-rh mytreya-rh force-pushed the allow_file_ownership branch 2 times, most recently from 41dd89b to 740df77 Compare April 3, 2026 08:07
Implements the CSI NodeServiceCapability RPC_VOLUME_MOUNT_GROUP so that
mounted secret files are chown'd to the pod's FSGroup.
This allows secrets to be not world-readable in non-root containers.

1. nodeServer::NodeGetCapabilities()
   advertise VOLUME_MOUNT_GROUP
2. nodeServer::NodePublishVolume()
   get POD's FSGroup if any from: req.VolumeCapability.GetMount().GetVolumeMountGroup()
3. pass the fsgroup onto (writer.go) WritePayloads()
4. include the FSGroup in the FileProjection struct (rename FileProjection::FSUser as FSGroup)
5. change AtomicWriter::writePayloadToDir() to chown the group based on FSGroup
6. Add relevant Unit tests, and e2eprovider tests
7. Bit of refactoring in the unit and e2eprovider tests to make them more terse
@mytreya-rh mytreya-rh force-pushed the allow_file_ownership branch from 740df77 to b4862da Compare April 3, 2026 08:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow file ownership to be set for secrets

8 participants