Harden pipeline interpreter state reload against transient MongoDB errors#25894
Draft
patrickmann wants to merge 1 commit intomasterfrom
Draft
Harden pipeline interpreter state reload against transient MongoDB errors#25894patrickmann wants to merge 1 commit intomasterfrom
patrickmann wants to merge 1 commit intomasterfrom
Conversation
…rors Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #25750
Description
PipelineInterpreterStateUpdater.reloadAndSave()could silently replace a valid pipeline state with an empty one when MongoDB hit a transient error during an event-triggered reload. Messages processed in that window bypassed all pipeline rules and landed in the default stream.This PR implements three of the four fixes from #25750:
Let
MongoExceptionpropagate fromMongoDbRuleServiceandMongoDbPipelineServiceloadAll()and friends, instead of swallowing it and returning an empty set. Transient MongoDB failures now fail loudly, so callers can react. Other callers (REST resources, content packs, migrations) get a 500 on transient failure, which is correct.Migrate state reload to a new
PipelineInterpreterStateReloadJob(SystemJob) submitted viaSystemJobManager. On failure the job retries with a 1 second delay. The constructor ofPipelineInterpreterStateUpdaternow performs the synchronous initial state load before registering on the event bus, closing the startup race window described in Pipeline rules not applied during multi-node restart due to async state reload race #25745. Pattern follows the existingPipelineMetadataUpdateJob.PipelineInterpreterStateUpdater.updateState()refuses to replace a non-empty state with an empty one and logs at WARN. Defense in depth.Null safety in
PipelineInterpreter.process(). IfgetLatestState()returns null, messages pass through unchanged with a warning log instead of NPE. The companion change forIlluminateMessageProcessor.process()is in Graylog2/graylog-plugin-enterprise#14157.Note on retry policy:
SystemJobResult.withRetryrequiresmaxRetries == Integer.MAX_VALUEuntil per-trigger retry tracking lands in the system scheduler.How Tested
Failed to reload pipeline interpreter state, retrying) and pipeline state is eventually rebuilt with no empty-state interval observed in message processing.Types of changes
Checklist: