Add Pod informer cache selector to reduce memory usage by AndySung320 · Pull Request #4761 · ray-project/kuberay

AndySung320 · 2026-04-23T07:46:21Z

Why are these changes needed?

In the current implementation, the KubeRay operator watches and caches all Pod resources in the cluster when watching all namespaces, while it only needs to manage Pods labeled with ray.io/node-type. Caching all Pod causes unnecessary memory consumption, especially in large-scale clusters with thousands of unrelated Pods.

This PR adds a Pod cache selector using the ray.io/node-type label, which is protected from user override in labelPod(), to filter the informer cache to only include KubeRay-managed Pods. This reduces the operator's memory footprint without affecting reconciliation behavior.

Related issue number

Closes #4625

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

Future-Outlier

can we not introduce a new file?

Future-Outlier · 2026-04-23T09:13:24Z

+
+// CacheSelectors returns ByObject options that restrict which Job and Pod objects
+// the manager's informers watch and store in the local cache.
+func CacheSelectors() (map[client.Object]cache.ByObject, error) {


why public function?

This function needs to be public function because main.go (package main) accesses it as ray.CacheSelectors(), and cross-package access in Go requires an exported function.
Since it needs to be accessible from both main.go and suite_test.go, it should live in the ray package. We could place it in an existing file like raycluster_controller.go to avoid adding a new file, but it may affect readability since that file is focused on reconcile logic. WDYT?

This doesn't convince me.
you can write test in main_test.go

main_test.go only contains simple unit tests and has no envtest setup. The integration test needs access to the reconciler internals and the envtest environment already established in suite_test.go, which is only available within package ray in controllers/ray/.
The test validates cache behavior by using k8sClient (direct API access) and mgr.GetClient() (cache-backed client) together with the running reconciler. I think this setup is not feasible in main_test.go without duplicating the entire envtest infrastructure, which is both complex and unconventional.

The purpose of the test is to validate whether the informer cache only caches labeled Pods. That's why I placed it in raycluster_controller_test.go, alongside the other controller integration tests.

Making it a public function is okay, but we should avoid exporting it by moving it into the internal directory (ref).

And we also need a better name for it. CacheSelectors is too vague.

I would place this in ray-operator/internal/managercache/cache.go with the function renamed to CacheByObject(). I avoided naming it internal/cache to prevent import conflicts with sigs.k8s.io/controller-runtime/pkg/cache.
WDYT ?

tks Rueian, golang master.

Future-Outlier · 2026-04-23T10:51:40Z

@codex review

chatgpt-codex-connector · 2026-04-23T11:00:46Z

Codex Review: Didn't find any major issues. 🚀

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

rueian · 2026-04-27T03:42:36Z

+			Expect(k8sClient.Get(ctx, client.ObjectKey{Namespace: namespace, Name: unrelatedPodName}, unrelatedPod)).Should(Succeed(), "unrelated pod visible to API")
+		})
+
+		It("The manager cache should only include Ray node Pods (ray.io/node-type in head|worker|redis-cleanup), not the unrelated Pod", func() {


Should we include submitter pods?

I think submitter pods don't need to be included in the cache.
In K8sJobMode, the reconciler only checks the job status by r.Client.Get(job) in checkSubmitterAndUpdateStatusIfNeeded. It never directly queries the submitter pod.
In SidecarMode, there is no separate submitter pod; the submitter runs as a sidecar container inside the head pod, which is already cached via the ray.io/node-type=head label.

rueian · 2026-05-02T01:23:59Z

+)
+
+// CacheByObject returns cache.ByObject entries that scope the manager client's Job and Pod watches.
+func CacheByObject() (map[client.Object]cache.ByObject, error) {


Suggested change

func CacheByObject() (map[client.Object]cache.ByObject, error) {

func K8sControllerRuntimeCacheSelectors() (map[client.Object]cache.ByObject, error) {

Signed-off-by: AndySung320 <andysung0320@gmail.com>

Future-Outlier · 2026-05-02T02:26:57Z

@codex review

chatgpt-codex-connector · 2026-05-02T02:32:19Z

Codex Review: Didn't find any major issues. 🎉

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

AndySung320 requested review from MortalHappiness, andrewsykim, kevin85421 and rueian as code owners April 23, 2026 07:46

Future-Outlier reviewed Apr 23, 2026

View reviewed changes

rueian reviewed Apr 27, 2026

View reviewed changes

rueian reviewed May 2, 2026

View reviewed changes

AndySung320 added 5 commits May 1, 2026 19:12

Add Pod cache selector to reduce informer memory usage

85f8f74

Signed-off-by: AndySung320 <andysung0320@gmail.com>

align with original pattern

075da61

Signed-off-by: AndySung320 <andysung0320@gmail.com>

refactor: move cache selector to internal/managercache package

579e192

Signed-off-by: AndySung320 <andysung0320@gmail.com>

fix: add internal/ directory to Dockerfile COPY instructions

24b9f34

Signed-off-by: AndySung320 <andysung0320@gmail.com>

rename function

ecf7f10

Signed-off-by: AndySung320 <andysung0320@gmail.com>

AndySung320 force-pushed the pod-cache branch from 481eb3b to ecf7f10 Compare May 2, 2026 02:13

	func CacheByObject() (map[client.Object]cache.ByObject, error) {
	func K8sControllerRuntimeCacheSelectors() (map[client.Object]cache.ByObject, error) {

Conversation

AndySung320 commented Apr 23, 2026

Why are these changes needed?

Related issue number

Checks

Uh oh!

Future-Outlier left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Future-Outlier commented Apr 23, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Future-Outlier commented May 2, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants