Skip to content

Add Pod informer cache selector to reduce memory usage#4761

Open
AndySung320 wants to merge 5 commits intoray-project:masterfrom
AndySung320:pod-cache
Open

Add Pod informer cache selector to reduce memory usage#4761
AndySung320 wants to merge 5 commits intoray-project:masterfrom
AndySung320:pod-cache

Conversation

@AndySung320
Copy link
Copy Markdown
Contributor

Why are these changes needed?

In the current implementation, the KubeRay operator watches and caches all Pod resources in the cluster when watching all namespaces, while it only needs to manage Pods labeled with ray.io/node-type. Caching all Pod causes unnecessary memory consumption, especially in large-scale clusters with thousands of unrelated Pods.

This PR adds a Pod cache selector using the ray.io/node-type label, which is protected from user override in labelPod(), to filter the informer cache to only include KubeRay-managed Pods. This reduces the operator's memory footprint without affecting reconciliation behavior.

Related issue number

Closes #4625

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Copy link
Copy Markdown
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we not introduce a new file?


// CacheSelectors returns ByObject options that restrict which Job and Pod objects
// the manager's informers watch and store in the local cache.
func CacheSelectors() (map[client.Object]cache.ByObject, error) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why public function?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function needs to be public function because main.go (package main) accesses it as ray.CacheSelectors(), and cross-package access in Go requires an exported function.
Since it needs to be accessible from both main.go and suite_test.go, it should live in the ray package. We could place it in an existing file like raycluster_controller.go to avoid adding a new file, but it may affect readability since that file is focused on reconcile logic. WDYT?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't convince me.
you can write test in main_test.go

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main_test.go only contains simple unit tests and has no envtest setup. The integration test needs access to the reconciler internals and the envtest environment already established in suite_test.go, which is only available within package ray in controllers/ray/.
The test validates cache behavior by using k8sClient (direct API access) and mgr.GetClient() (cache-backed client) together with the running reconciler. I think this setup is not feasible in main_test.go without duplicating the entire envtest infrastructure, which is both complex and unconventional.

The purpose of the test is to validate whether the informer cache only caches labeled Pods. That's why I placed it in raycluster_controller_test.go, alongside the other controller integration tests.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making it a public function is okay, but we should avoid exporting it by moving it into the internal directory (ref).

And we also need a better name for it. CacheSelectors is too vague.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would place this in ray-operator/internal/managercache/cache.go with the function renamed to CacheByObject(). I avoided naming it internal/cache to prevent import conflicts with sigs.k8s.io/controller-runtime/pkg/cache.
WDYT ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tks Rueian, golang master.

@Future-Outlier
Copy link
Copy Markdown
Member

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. 🚀

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Expect(k8sClient.Get(ctx, client.ObjectKey{Namespace: namespace, Name: unrelatedPodName}, unrelatedPod)).Should(Succeed(), "unrelated pod visible to API")
})

It("The manager cache should only include Ray node Pods (ray.io/node-type in head|worker|redis-cleanup), not the unrelated Pod", func() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include submitter pods?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think submitter pods don't need to be included in the cache.
In K8sJobMode, the reconciler only checks the job status by r.Client.Get(job) in checkSubmitterAndUpdateStatusIfNeeded. It never directly queries the submitter pod.
In SidecarMode, there is no separate submitter pod; the submitter runs as a sidecar container inside the head pod, which is already cached via the ray.io/node-type=head label.

)

// CacheByObject returns cache.ByObject entries that scope the manager client's Job and Pod watches.
func CacheByObject() (map[client.Object]cache.ByObject, error) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func CacheByObject() (map[client.Object]cache.ByObject, error) {
func K8sControllerRuntimeCacheSelectors() (map[client.Object]cache.ByObject, error) {

Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
Signed-off-by: AndySung320 <andysung0320@gmail.com>
@Future-Outlier
Copy link
Copy Markdown
Member

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. 🎉

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[perf] [raycluster] Add cache selector to limit Pod caching to KubeRay-managed Pods only

3 participants