Fix unbounded memory growth in exemplar storage by raftar200197 · Pull Request #7937 · open-telemetry/opentelemetry-go

raftar200197 · 2026-02-20T16:53:18Z

Summary

This PR fixes a critical memory leak in the OpenTelemetry Go SDK's exemplar storage mechanism. Under high metric cardinality scenarios, exemplar-related slices grow unbounded and never shrink, leading to sustained memory pressure and potential OOM conditions.

Problem Description

Root Cause

The reset() function in both sdk/metric/internal/aggregate/aggregate.go and sdk/metric/exemplar/storage.go manages slice capacity for exemplar storage. The current implementation grows slices when needed but never shrinks them:

func reset[T any](s []T, length, capacity int) []T {
    if cap(s) < capacity {
        return make([]T, length, capacity)
    }
    return s[:length]  // ← Never shrinks, retains capacity forever
}

Issue: Once a slice grows to accommodate a spike in exemplar volume, it never shrinks back, even when subsequent collections require much less capacity.

Impact

Under high cardinality scenarios:

Memory Growth: Slice capacity increases during spikes but never decreases
Multiplicative Effect: Many metric series × independently growing slices = significant memory growth
No Recovery: Memory remains allocated even after metric flush intervals
Production Risk: Can lead to OOM crashes and pod restarts under sustained load

Evidence

From production heap profiles:

reset[int64]: 17.11 MB → 25.67 MB
reset[float64]: 14.09 MB → 18.12 MB
Memory growth correlated with metric cardinality
Memory did not decrease after metric flush intervals

Solution

Implement a capacity shrinking strategy in the reset() function:

Grow when needed: Allocate new slice if capacity is insufficient
Reuse when reasonable: Keep existing slice if capacity is ≤ 2× required
Shrink when excessive: Allocate new slice if capacity > 2× required

This approach balances:

Memory efficiency: Prevents unbounded growth
Performance: Avoids excessive reallocations for minor fluctuations
Simplicity: Uses a fixed 2× threshold that's easy to understand

Implementation

func reset[T any](s []T, length, capacity int) []T {
    if cap(s) < capacity {
        return make([]T, length, capacity)
    }
    // If the current capacity is more than 2x what we need, shrink it
    const maxCapacityMultiplier = 2
    if cap(s) > capacity*maxCapacityMultiplier && capacity > 0 {
        return make([]T, length, capacity)
    }
    return s[:length]
}

Changes Made

Modified Files

sdk/metric/internal/aggregate/aggregate.go
- Enhanced reset() function with capacity shrinking logic
- Added comprehensive documentation
sdk/metric/exemplar/storage.go
- Enhanced reset() function with capacity shrinking logic
- Added comprehensive documentation
CHANGELOG.md
- Added entry under "Fixed" section

New Test Files

sdk/metric/internal/aggregate/aggregate_test.go
- Added TestReset() with 8 comprehensive test cases
sdk/metric/exemplar/storage_test.go (new file)
- Added TestReset() with 9 comprehensive test cases including memory leak simulation

Testing

Test Coverage

All test cases pass successfully:

✓ TestReset/AllocatesWhenCapacityTooSmall
✓ TestReset/ReusesSliceWhenCapacitySufficient
✓ TestReset/ShrinksWhenCapacityExcessive
✓ TestReset/DoesNotShrinkWhenCapacityReasonable
✓ TestReset/ShrinksWhenCapacityJustAboveThreshold
✓ TestReset/HandlesZeroCapacity
✓ TestReset/PreservesDataWhenShrinking
✓ TestReset/SimulatesHighCardinalityScenario
✓ TestReset/PreventsMemoryLeakOverMultipleFlushes

Regression Testing

✅ All existing tests in sdk/metric/internal/aggregate pass
✅ All existing tests in sdk/metric/exemplar pass
✅ make precommit passes (formatting, linting, validation)

Performance Considerations

Memory Impact

Before: Unbounded growth under high cardinality

Flush 1 → capacity 10
Flush 2 → capacity 50 (spike)
Flush 3 → capacity 50 (retained, wasted memory)
Flush 4 → capacity 50 (retained, wasted memory)

After: Controlled growth with automatic shrinking

Flush 1 → capacity 10
Flush 2 → capacity 50 (spike)
Flush 3 → capacity 10 (shrunk back)
Flush 4 → capacity 10 (stable)

CPU Impact

Minimal overhead: Only reallocates when capacity > 2× required
Amortized efficiency: Tolerates 2× overhead to avoid frequent reallocations
No hot path impact: Only affects collection/flush path, not measurement recording

Backward Compatibility

✅ Fully backward compatible

No API changes
No behavioral changes for normal operations
Only affects internal memory management
Existing code continues to work unchanged

Verification

To verify this fix in production:

Deploy to environment with high metric cardinality
Monitor heap memory usage over multiple flush cycles
Verify memory stabilizes instead of growing unbounded
Check that reset[int64] and reset[float64] allocations remain bounded

Additional Notes

The 2× capacity multiplier threshold was chosen as a balance between:

Memory efficiency: Prevents significant waste
Performance: Avoids excessive reallocations
Simplicity: Easy to understand and maintain

This commit addresses a memory leak where exemplar storage slices grow unbounded under high metric cardinality and never shrink. The reset() function now implements a capacity shrinking strategy: - Grows when capacity is insufficient - Reuses when capacity is reasonable (≤2× required) - Shrinks when capacity is excessive (>2× required) This prevents memory accumulation while maintaining performance by avoiding excessive reallocations. Under high cardinality scenarios, exemplar slices would grow to accommodate spikes but never release memory after the spike ended. With many metric series, this multiplicative effect led to continuous heap growth and potential OOM conditions. The fix uses a 2× threshold as a balance between memory efficiency and performance, avoiding frequent reallocations while preventing unbounded growth. Changes: - Enhanced reset() in sdk/metric/internal/aggregate/aggregate.go - Enhanced reset() in sdk/metric/exemplar/storage.go - Added comprehensive tests for capacity management - Updated CHANGELOG.md

linux-foundation-easycla · 2026-02-20T16:53:26Z

The committers listed above are authorized under a signed CLA.

✅ login: raftar200197 / name: raftar200197 (a880c28)

Make comments more concise as suggested by maintainers.

codecov · 2026-02-20T22:34:57Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.7%. Comparing base (64f28b0) to head (96ae6d6).

Additional details and impacted files

@@          Coverage Diff          @@
##            main   #7937   +/-   ##
=====================================
  Coverage   81.7%   81.7%           
=====================================
  Files        304     304           
  Lines      23283   23289    +6     
=====================================
+ Hits       19032   19038    +6     
  Misses      3864    3864           
  Partials     387     387

Files with missing lines	Coverage Δ
sdk/metric/exemplar/storage.go	`100.0% <100.0%> (ø)`
sdk/metric/internal/aggregate/aggregate.go	`100.0% <100.0%> (ø)`

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dashpole · 2026-02-24T01:38:36Z

Evidence
From production heap profiles:

reset[int64]: 17.11 MB → 25.67 MB
reset[float64]: 14.09 MB → 18.12 MB
Memory growth correlated with metric cardinality
Memory did not decrease after metric flush intervals

A few questions:

Can you share more about your production workload? In particular, what is your sampling interval, QPS (or estimate of the number of metric observations), and export interval.
Are you are exporting deltas or cumulatives?
Is there anything else interesting that you are doing that might be related?
How are you measuring memory usage?

The capacity of the exemplar reservoir is determined by the reservoir provider. There shouldn't be "spikes" in the sizes of exemplar reservoirs. There may be some reservoirs that are larger than others (e.g. if you have fixed-bucket histograms with a lot of buckets), so its possible that is playing a part here as well.

The only downside to this change is more allocations when we downsize after collection. That may cause CPU spikes right after collection if it puts pressure on the garbage collector.

raftar200197 requested review from MrAlias, XSAM, dashpole, dmathieu, flc1125 and pellared as code owners February 20, 2026 16:53

raftar200197 added 2 commits February 20, 2026 22:28

Update CHANGELOG with PR open-telemetry#7937

a5a1f29

Simplify comments in reset() function

96ae6d6

Make comments more concise as suggested by maintainers.

Merge branch 'main' into fix/exemplar-memory-leak

a880c28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix unbounded memory growth in exemplar storage#7937

Fix unbounded memory growth in exemplar storage#7937
raftar200197 wants to merge 4 commits intoopen-telemetry:mainfrom
raftar200197:fix/exemplar-memory-leak

raftar200197 commented Feb 20, 2026 •

edited

Loading

Uh oh!

linux-foundation-easycla bot commented Feb 20, 2026 •

edited

Loading

Uh oh!

codecov bot commented Feb 20, 2026

Uh oh!

dashpole commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

raftar200197 commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem Description

Root Cause

Impact

Evidence

Solution

Implementation

Changes Made

Modified Files

New Test Files

Testing

Test Coverage

Regression Testing

Performance Considerations

Memory Impact

CPU Impact

Backward Compatibility

Verification

Additional Notes

Uh oh!

linux-foundation-easycla bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Feb 20, 2026

Codecov Report

Uh oh!

dashpole commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

raftar200197 commented Feb 20, 2026 •

edited

Loading

linux-foundation-easycla bot commented Feb 20, 2026 •

edited

Loading