Fix unbounded memory growth in exemplar storage#7937
Fix unbounded memory growth in exemplar storage#7937raftar200197 wants to merge 4 commits intoopen-telemetry:mainfrom
Conversation
This commit addresses a memory leak where exemplar storage slices grow unbounded under high metric cardinality and never shrink. The reset() function now implements a capacity shrinking strategy: - Grows when capacity is insufficient - Reuses when capacity is reasonable (≤2× required) - Shrinks when capacity is excessive (>2× required) This prevents memory accumulation while maintaining performance by avoiding excessive reallocations. Under high cardinality scenarios, exemplar slices would grow to accommodate spikes but never release memory after the spike ended. With many metric series, this multiplicative effect led to continuous heap growth and potential OOM conditions. The fix uses a 2× threshold as a balance between memory efficiency and performance, avoiding frequent reallocations while preventing unbounded growth. Changes: - Enhanced reset() in sdk/metric/internal/aggregate/aggregate.go - Enhanced reset() in sdk/metric/exemplar/storage.go - Added comprehensive tests for capacity management - Updated CHANGELOG.md
|
|
Make comments more concise as suggested by maintainers.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #7937 +/- ##
=====================================
Coverage 81.7% 81.7%
=====================================
Files 304 304
Lines 23283 23289 +6
=====================================
+ Hits 19032 19038 +6
Misses 3864 3864
Partials 387 387
🚀 New features to boost your workflow:
|
A few questions:
The capacity of the exemplar reservoir is determined by the reservoir provider. There shouldn't be "spikes" in the sizes of exemplar reservoirs. There may be some reservoirs that are larger than others (e.g. if you have fixed-bucket histograms with a lot of buckets), so its possible that is playing a part here as well. The only downside to this change is more allocations when we downsize after collection. That may cause CPU spikes right after collection if it puts pressure on the garbage collector. |
Summary
This PR fixes a critical memory leak in the OpenTelemetry Go SDK's exemplar storage mechanism. Under high metric cardinality scenarios, exemplar-related slices grow unbounded and never shrink, leading to sustained memory pressure and potential OOM conditions.
Problem Description
Root Cause
The
reset()function in bothsdk/metric/internal/aggregate/aggregate.goandsdk/metric/exemplar/storage.gomanages slice capacity for exemplar storage. The current implementation grows slices when needed but never shrinks them:Issue: Once a slice grows to accommodate a spike in exemplar volume, it never shrinks back, even when subsequent collections require much less capacity.
Impact
Under high cardinality scenarios:
Evidence
From production heap profiles:
reset[int64]: 17.11 MB → 25.67 MBreset[float64]: 14.09 MB → 18.12 MBSolution
Implement a capacity shrinking strategy in the
reset()function:This approach balances:
Implementation
Changes Made
Modified Files
sdk/metric/internal/aggregate/aggregate.goreset()function with capacity shrinking logicsdk/metric/exemplar/storage.goreset()function with capacity shrinking logicCHANGELOG.mdNew Test Files
sdk/metric/internal/aggregate/aggregate_test.goTestReset()with 8 comprehensive test casessdk/metric/exemplar/storage_test.go(new file)TestReset()with 9 comprehensive test cases including memory leak simulationTesting
Test Coverage
All test cases pass successfully:
Regression Testing
sdk/metric/internal/aggregatepasssdk/metric/exemplarpassmake precommitpasses (formatting, linting, validation)Performance Considerations
Memory Impact
Before: Unbounded growth under high cardinality
After: Controlled growth with automatic shrinking
CPU Impact
Backward Compatibility
✅ Fully backward compatible
Verification
To verify this fix in production:
reset[int64]andreset[float64]allocations remain boundedAdditional Notes
The 2× capacity multiplier threshold was chosen as a balance between: