Add Pod informer cache selector to reduce memory usage#4761
Add Pod informer cache selector to reduce memory usage#4761AndySung320 wants to merge 5 commits intoray-project:masterfrom
Conversation
Future-Outlier
left a comment
There was a problem hiding this comment.
can we not introduce a new file?
|
|
||
| // CacheSelectors returns ByObject options that restrict which Job and Pod objects | ||
| // the manager's informers watch and store in the local cache. | ||
| func CacheSelectors() (map[client.Object]cache.ByObject, error) { |
There was a problem hiding this comment.
This function needs to be public function because main.go (package main) accesses it as ray.CacheSelectors(), and cross-package access in Go requires an exported function.
Since it needs to be accessible from both main.go and suite_test.go, it should live in the ray package. We could place it in an existing file like raycluster_controller.go to avoid adding a new file, but it may affect readability since that file is focused on reconcile logic. WDYT?
There was a problem hiding this comment.
This doesn't convince me.
you can write test in main_test.go
There was a problem hiding this comment.
main_test.go only contains simple unit tests and has no envtest setup. The integration test needs access to the reconciler internals and the envtest environment already established in suite_test.go, which is only available within package ray in controllers/ray/.
The test validates cache behavior by using k8sClient (direct API access) and mgr.GetClient() (cache-backed client) together with the running reconciler. I think this setup is not feasible in main_test.go without duplicating the entire envtest infrastructure, which is both complex and unconventional.
The purpose of the test is to validate whether the informer cache only caches labeled Pods. That's why I placed it in raycluster_controller_test.go, alongside the other controller integration tests.
There was a problem hiding this comment.
Making it a public function is okay, but we should avoid exporting it by moving it into the internal directory (ref).
And we also need a better name for it. CacheSelectors is too vague.
There was a problem hiding this comment.
I would place this in ray-operator/internal/managercache/cache.go with the function renamed to CacheByObject(). I avoided naming it internal/cache to prevent import conflicts with sigs.k8s.io/controller-runtime/pkg/cache.
WDYT ?
There was a problem hiding this comment.
tks Rueian, golang master.
|
@codex review |
|
Codex Review: Didn't find any major issues. 🚀 ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
| Expect(k8sClient.Get(ctx, client.ObjectKey{Namespace: namespace, Name: unrelatedPodName}, unrelatedPod)).Should(Succeed(), "unrelated pod visible to API") | ||
| }) | ||
|
|
||
| It("The manager cache should only include Ray node Pods (ray.io/node-type in head|worker|redis-cleanup), not the unrelated Pod", func() { |
There was a problem hiding this comment.
Should we include submitter pods?
There was a problem hiding this comment.
I think submitter pods don't need to be included in the cache.
In K8sJobMode, the reconciler only checks the job status by r.Client.Get(job) in checkSubmitterAndUpdateStatusIfNeeded. It never directly queries the submitter pod.
In SidecarMode, there is no separate submitter pod; the submitter runs as a sidecar container inside the head pod, which is already cached via the ray.io/node-type=head label.
| ) | ||
|
|
||
| // CacheByObject returns cache.ByObject entries that scope the manager client's Job and Pod watches. | ||
| func CacheByObject() (map[client.Object]cache.ByObject, error) { |
There was a problem hiding this comment.
| func CacheByObject() (map[client.Object]cache.ByObject, error) { | |
| func K8sControllerRuntimeCacheSelectors() (map[client.Object]cache.ByObject, error) { |
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
|
@codex review |
|
Codex Review: Didn't find any major issues. 🎉 ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Why are these changes needed?
In the current implementation, the KubeRay operator watches and caches all Pod resources in the cluster when watching all namespaces, while it only needs to manage Pods labeled with
ray.io/node-type. Caching all Pod causes unnecessary memory consumption, especially in large-scale clusters with thousands of unrelated Pods.This PR adds a Pod cache selector using the
ray.io/node-typelabel, which is protected from user override in labelPod(), to filter the informer cache to only include KubeRay-managed Pods. This reduces the operator's memory footprint without affecting reconciliation behavior.Related issue number
Closes #4625
Checks