Skip to content

Fix unbounded memory growth in exemplar storage#7937

Open
raftar200197 wants to merge 4 commits intoopen-telemetry:mainfrom
raftar200197:fix/exemplar-memory-leak
Open

Fix unbounded memory growth in exemplar storage#7937
raftar200197 wants to merge 4 commits intoopen-telemetry:mainfrom
raftar200197:fix/exemplar-memory-leak

Conversation

@raftar200197
Copy link

@raftar200197 raftar200197 commented Feb 20, 2026


Summary

This PR fixes a critical memory leak in the OpenTelemetry Go SDK's exemplar storage mechanism. Under high metric cardinality scenarios, exemplar-related slices grow unbounded and never shrink, leading to sustained memory pressure and potential OOM conditions.

Problem Description

Root Cause

The reset() function in both sdk/metric/internal/aggregate/aggregate.go and sdk/metric/exemplar/storage.go manages slice capacity for exemplar storage. The current implementation grows slices when needed but never shrinks them:

func reset[T any](s []T, length, capacity int) []T {
    if cap(s) < capacity {
        return make([]T, length, capacity)
    }
    return s[:length]  // ← Never shrinks, retains capacity forever
}

Issue: Once a slice grows to accommodate a spike in exemplar volume, it never shrinks back, even when subsequent collections require much less capacity.

Impact

Under high cardinality scenarios:

  • Memory Growth: Slice capacity increases during spikes but never decreases
  • Multiplicative Effect: Many metric series × independently growing slices = significant memory growth
  • No Recovery: Memory remains allocated even after metric flush intervals
  • Production Risk: Can lead to OOM crashes and pod restarts under sustained load

Evidence

From production heap profiles:

  • reset[int64]: 17.11 MB → 25.67 MB
  • reset[float64]: 14.09 MB → 18.12 MB
  • Memory growth correlated with metric cardinality
  • Memory did not decrease after metric flush intervals

Solution

Implement a capacity shrinking strategy in the reset() function:

  1. Grow when needed: Allocate new slice if capacity is insufficient
  2. Reuse when reasonable: Keep existing slice if capacity is ≤ 2× required
  3. Shrink when excessive: Allocate new slice if capacity > 2× required

This approach balances:

  • Memory efficiency: Prevents unbounded growth
  • Performance: Avoids excessive reallocations for minor fluctuations
  • Simplicity: Uses a fixed 2× threshold that's easy to understand

Implementation

func reset[T any](s []T, length, capacity int) []T {
    if cap(s) < capacity {
        return make([]T, length, capacity)
    }
    // If the current capacity is more than 2x what we need, shrink it
    const maxCapacityMultiplier = 2
    if cap(s) > capacity*maxCapacityMultiplier && capacity > 0 {
        return make([]T, length, capacity)
    }
    return s[:length]
}

Changes Made

Modified Files

  1. sdk/metric/internal/aggregate/aggregate.go

    • Enhanced reset() function with capacity shrinking logic
    • Added comprehensive documentation
  2. sdk/metric/exemplar/storage.go

    • Enhanced reset() function with capacity shrinking logic
    • Added comprehensive documentation
  3. CHANGELOG.md

    • Added entry under "Fixed" section

New Test Files

  1. sdk/metric/internal/aggregate/aggregate_test.go

    • Added TestReset() with 8 comprehensive test cases
  2. sdk/metric/exemplar/storage_test.go (new file)

    • Added TestReset() with 9 comprehensive test cases including memory leak simulation

Testing

Test Coverage

All test cases pass successfully:

✓ TestReset/AllocatesWhenCapacityTooSmall
✓ TestReset/ReusesSliceWhenCapacitySufficient
✓ TestReset/ShrinksWhenCapacityExcessive
✓ TestReset/DoesNotShrinkWhenCapacityReasonable
✓ TestReset/ShrinksWhenCapacityJustAboveThreshold
✓ TestReset/HandlesZeroCapacity
✓ TestReset/PreservesDataWhenShrinking
✓ TestReset/SimulatesHighCardinalityScenario
✓ TestReset/PreventsMemoryLeakOverMultipleFlushes

Regression Testing

  • ✅ All existing tests in sdk/metric/internal/aggregate pass
  • ✅ All existing tests in sdk/metric/exemplar pass
  • make precommit passes (formatting, linting, validation)

Performance Considerations

Memory Impact

Before: Unbounded growth under high cardinality

Flush 1 → capacity 10
Flush 2 → capacity 50 (spike)
Flush 3 → capacity 50 (retained, wasted memory)
Flush 4 → capacity 50 (retained, wasted memory)

After: Controlled growth with automatic shrinking

Flush 1 → capacity 10
Flush 2 → capacity 50 (spike)
Flush 3 → capacity 10 (shrunk back)
Flush 4 → capacity 10 (stable)

CPU Impact

  • Minimal overhead: Only reallocates when capacity > 2× required
  • Amortized efficiency: Tolerates 2× overhead to avoid frequent reallocations
  • No hot path impact: Only affects collection/flush path, not measurement recording

Backward Compatibility

Fully backward compatible

  • No API changes
  • No behavioral changes for normal operations
  • Only affects internal memory management
  • Existing code continues to work unchanged

Verification

To verify this fix in production:

  1. Deploy to environment with high metric cardinality
  2. Monitor heap memory usage over multiple flush cycles
  3. Verify memory stabilizes instead of growing unbounded
  4. Check that reset[int64] and reset[float64] allocations remain bounded

Additional Notes

The 2× capacity multiplier threshold was chosen as a balance between:

  • Memory efficiency: Prevents significant waste
  • Performance: Avoids excessive reallocations
  • Simplicity: Easy to understand and maintain

This commit addresses a memory leak where exemplar storage slices
grow unbounded under high metric cardinality and never shrink.

The reset() function now implements a capacity shrinking strategy:
- Grows when capacity is insufficient
- Reuses when capacity is reasonable (≤2× required)
- Shrinks when capacity is excessive (>2× required)

This prevents memory accumulation while maintaining performance
by avoiding excessive reallocations.

Under high cardinality scenarios, exemplar slices would grow to
accommodate spikes but never release memory after the spike ended.
With many metric series, this multiplicative effect led to
continuous heap growth and potential OOM conditions.

The fix uses a 2× threshold as a balance between memory efficiency
and performance, avoiding frequent reallocations while preventing
unbounded growth.

Changes:
- Enhanced reset() in sdk/metric/internal/aggregate/aggregate.go
- Enhanced reset() in sdk/metric/exemplar/storage.go
- Added comprehensive tests for capacity management
- Updated CHANGELOG.md
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Feb 20, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: raftar200197 / name: raftar200197 (a880c28)

@codecov
Copy link

codecov bot commented Feb 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.7%. Comparing base (64f28b0) to head (96ae6d6).

Additional details and impacted files

Impacted file tree graph

@@          Coverage Diff          @@
##            main   #7937   +/-   ##
=====================================
  Coverage   81.7%   81.7%           
=====================================
  Files        304     304           
  Lines      23283   23289    +6     
=====================================
+ Hits       19032   19038    +6     
  Misses      3864    3864           
  Partials     387     387           
Files with missing lines Coverage Δ
sdk/metric/exemplar/storage.go 100.0% <100.0%> (ø)
sdk/metric/internal/aggregate/aggregate.go 100.0% <100.0%> (ø)

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dashpole
Copy link
Contributor

Evidence
From production heap profiles:

reset[int64]: 17.11 MB → 25.67 MB
reset[float64]: 14.09 MB → 18.12 MB
Memory growth correlated with metric cardinality
Memory did not decrease after metric flush intervals

A few questions:

  • Can you share more about your production workload? In particular, what is your sampling interval, QPS (or estimate of the number of metric observations), and export interval.
  • Are you are exporting deltas or cumulatives?
  • Is there anything else interesting that you are doing that might be related?
  • How are you measuring memory usage?

The capacity of the exemplar reservoir is determined by the reservoir provider. There shouldn't be "spikes" in the sizes of exemplar reservoirs. There may be some reservoirs that are larger than others (e.g. if you have fixed-bucket histograms with a lot of buckets), so its possible that is playing a part here as well.

The only downside to this change is more allocations when we downsize after collection. That may cause CPU spikes right after collection if it puts pressure on the garbage collector.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants