[runtime] Handle notifyCheckpointAborted to stop leaking checkpoint entries#667
Open
weiqingy wants to merge 1 commit into
Open
[runtime] Handle notifyCheckpointAborted to stop leaking checkpoint entries#667weiqingy wants to merge 1 commit into
weiqingy wants to merge 1 commit into
Conversation
…ntries Issue apache#665. When Flink aborts a checkpoint, it calls notifyCheckpointAborted instead of notifyCheckpointComplete. The DurableExecutionManager only handled the complete path, so the per-checkpoint sequence-number entry recorded by snapshotLastCompletedSequenceNumbers was never released for aborted checkpoints. Under sustained abort pressure (timeouts, alignment failures, backend pressure), checkpointIdToSeqNums grew unboundedly. Changes: - Add DurableExecutionManager.notifyCheckpointAborted(long): removes the entry from checkpointIdToSeqNums, guarded by the same actionStateStore != null check as notifyCheckpointComplete. Does NOT prune durable action state — the aborted checkpoint's writes were never committed, so the prior committed checkpoint's recovery state is still load-bearing and must not be pruned. - Add ActionExecutionOperator.notifyCheckpointAborted(long): thin override that delegates to the manager and then calls super, mirroring the existing notifyCheckpointComplete override. - Extend the symmetric-guard invariant javadoc on snapshotLastCompletedSequenceNumbers and notifyCheckpointComplete to name both release paths (complete OR abort). The actionStateStore != null guard now lives on three methods; the cross-linked javadoc makes that explicit and cites issues apache#645 and apache#665. - Add @VisibleForTesting accessor for checkpointIdToSeqNums to enable the new regression tests. Mirrors the existing getActionStateStore() pattern. - Three new DurableExecutionManagerTest cases: * notifyAbortedRemovesEntryWithoutPruning — entry released, durable state untouched (verified against a real InMemoryActionStateStore so wrongful pruning would be observable). * completedAndAbortedInterleavedKeepsInFlightEntries — three in-flight checkpoints, one completes (state pruned), one aborts (state preserved), one remains. * noStoreModeNotifyCheckpointAbortedIsNoOp — symmetric null-store no-op coverage matching the existing notifyCheckpointComplete null-store case.
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #665.
What this PR does
DurableExecutionManager.checkpointIdToSeqNumsleaked entries on aborted checkpoints. Flink callsnotifyCheckpointAborted(...)when a checkpoint is aborted (timeout, alignment failure, backend pressure); the existing code only handled the complete path. Under sustained abort pressure the map grew unboundedly.Production changes
DurableExecutionManager.notifyCheckpointAborted(long)(new, package-private) — removes the entry fromcheckpointIdToSeqNums. NopruneStatecall — durable writes for an aborted checkpoint were never committed, so the prior committed checkpoint's recovery state is still load-bearing and must not be pruned. Guarded byactionStateStore != nullto mirror the symmetric guard from [Bug][runtime] Fix memory leak in DurableExecutionManager.checkpointIdToSeqNums #645.ActionExecutionOperator.notifyCheckpointAborted(long)(new@Override) — thin delegate to the manager, thensuper.notifyCheckpointAborted(...). Mirrors the existingnotifyCheckpointCompleteoverride exactly.snapshotLastCompletedSequenceNumbersandnotifyCheckpointCompletestrengthened to name BOTH release paths (complete OR abort). TheactionStateStore != nullguard now lives on three methods; the javadoc makes the three-way symmetry explicit and cross-links [Bug][runtime] Fix memory leak in DurableExecutionManager.checkpointIdToSeqNums #645 + [Bug][runtime] DurableExecutionManager leaks checkpointIdToSeqNums entries on aborted checkpoints #665.@VisibleForTesting getCheckpointIdToSeqNums()accessor — mirrorsgetActionStateStore()precedent. (Same addition as in [runtime] Lock null-store symmetry invariant in DurableExecutionManager #666; the second PR to land drops the duplicate on rebase.)Tests (DEM-level, three new)
notifyAbortedRemovesEntryWithoutPruning— entry released, durable state untouched. Uses a realInMemoryActionStateStore(not a mock) so wrongful pruning would be observable.completedAndAbortedInterleavedKeepsInFlightEntries— three in-flight checkpoints; one completes (state pruned), one aborts (state preserved), one remains.noStoreModeNotifyCheckpointAbortedIsNoOp— symmetric null-store no-op coverage.Sanity-mutation verified locally:
actionStateStore.pruneState(...)call → 2 of 3 new tests fail (state was incorrectly pruned).Operator-level harness test deferred to #646 — the new operator override is a one-line delegate; the logic is in the manager.
Test plan
mvn test -Dtest=DurableExecutionManagerTest -pl runtime— 5/5 pass (2 existing + 3 new)mvn test -Dtest=ActionExecutionOperatorTest -pl runtime— 28/28 passmvn test -pl runtime— 307/307 pass (no regressions)./tools/lint.sh -c— 0 violations./tools/check-license.sh— clean (no new tracked files)pruneStatecall on abort → expected tests failDocumentation
doc-neededdoc-not-neededdoc-included