Skip to content

⚠️ fix goroutine leak in cache#13487

Open
sivchari wants to merge 1 commit intokubernetes-sigs:mainfrom
sivchari:fix-goroutine-leak
Open

⚠️ fix goroutine leak in cache#13487
sivchari wants to merge 1 commit intokubernetes-sigs:mainfrom
sivchari:fix-goroutine-leak

Conversation

@sivchari
Copy link
Member

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign sbueringer for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from AndiDog March 19, 2026 04:40
@k8s-ci-robot k8s-ci-robot added the do-not-merge/needs-area PR is missing an area label label Mar 19, 2026
@k8s-ci-robot k8s-ci-robot requested a review from JoelSpeed March 19, 2026 04:40
@k8s-ci-robot
Copy link
Contributor

This PR is currently missing an area label, which is used to identify the modified component when generating release notes.

Area labels can be added by org members by writing /area ${COMPONENT} in a comment

Please see the labels list for possible areas.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 19, 2026
@sivchari sivchari force-pushed the fix-goroutine-leak branch 2 times, most recently from 45abe5e to 50b15e6 Compare March 19, 2026 12:28
Signed-off-by: sivchari <shibuuuu5@gmail.com>
@sivchari sivchari force-pushed the fix-goroutine-leak branch from 50b15e6 to 568b11e Compare March 19, 2026 12:41
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Mar 19, 2026

@sivchari: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-apidiff-main 568b11e link false /test pull-cluster-api-apidiff-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@zjs
Copy link

zjs commented Mar 22, 2026

The proposed fix seems to be correct, but I wanted to raise a concern about the approach: this changes the signatures of several public APIs (cache.New, Builder.Complete, Builder.Build, internalruntimeclient.New), which is why pull-cluster-api-apidiff-main is failing. It also requires threading ctx through ~45 call sites that don't otherwise need it.

An alternative that avoids all of the public API changes would be to add a Close() method to the Cache interface instead. The New signature stays the same, and callers that own the cache's lifecycle call Close() when they're done:

type Cache[E Entry] interface {
	Add(entry E)
	Has(key string) (E, bool)
	Len() int
	DeleteAll()
	Close()
}

func New[E Entry](ttl time.Duration) Cache[E] {
	done := make(chan struct{})
	r := &cache[E]{
		Store: kcache.NewTTLStore(func(obj any) (string, error) {
			return obj.(E).Key(), nil
		}, ttl),
		done: done,
	}
	go func() {
		for {
			r.List()
			select {
			case <-done:
				return
			case <-time.After(expirationInterval):
			}
		}
	}()
	return r
}

type cache[E Entry] struct {
	kcache.Store
	done   chan struct{}
	closed sync.Once
}

func (r *cache[E]) Close() {
	r.closed.Do(func() { close(r.done) })
}

For the Builder, which creates a reconcileCache internally, it can register a Runnable with the manager to close it on shutdown:

func (blder *Builder) Build(r reconcile.TypedReconciler[reconcile.Request]) (Controller, error) {
	// ... existing setup ...

	reconcileCache := cache.New[reconcileCacheEntry](cache.DefaultTTL)

	// Clean up the cache goroutine when the manager stops.
	_ = blder.mgr.Add(manager.RunnableFunc(func(ctx context.Context) error {
		<-ctx.Done()
		reconcileCache.Close()
		return nil
	}))

	// ... rest of Build unchanged ...
}

For controllers that create caches in SetupWithManager, the same pattern works — no need to change SetupWithManager or Build/Complete signatures:

func (r *Reconciler) SetupWithManager(ctx context.Context, mgr ctrl.Manager, options controller.Options) error {
	// ... existing builder chain ...
	c, err := /* ... */.Build(r)
	if err != nil {
		return errors.Wrap(err, "failed setting up with a controller manager")
	}

	r.hookCache = cache.New[cache.HookEntry](cache.HookCacheDefaultTTL)

	// Clean up when the manager stops.
	_ = mgr.Add(manager.RunnableFunc(func(ctx context.Context) error {
		<-ctx.Done()
		r.hookCache.Close()
		return nil
	}))

	// ... rest unchanged ...
}

The same approach could work for internalruntimeclient.New as well.

One downside is that callers need to remember to call Close(), but that's the same trade-off as io.Closer everywhere in the standard library, and the mgr.Add(RunnableFunc(...)) pattern makes it straightforward.

And it's potentially breaking for implementors of Cache -- but it seems like there would be few external implementors of this interface, and for even those folks this is still less disruptive than the change to the cache.New, Builder.Build, Builder.Complete signatures.

But maybe there are reasons to prefer the current PR to this approach. What do you think?

@sivchari sivchari changed the title 🐛 fix goroutine leak in cache ⚠️ fix goroutine leak in cache Mar 22, 2026
@sivchari
Copy link
Member Author

Thanks for the detailed suggestion.

The apidiff failure is expected, as the API change is intentional. Regarding the alternative Close() approach, adding Close() to the Cache interface would also be a breaking change for any external implementors, so neither approach avoids breaking compatibility on that front. Furthermore, with the ctx-based approach, goroutine cleanup happens automatically via context cancellation. In contrast, with Close(), every caller must remember to call it or wire up mgr.Add, and forgetting to do so would silently re-introduce the leak. For these reasons, I believe tying the goroutine lifecycle to the context is the safer design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area PR is missing an area label size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants