Skip to content

Harden pipeline interpreter state reload against transient MongoDB errors#25894

Draft
patrickmann wants to merge 1 commit intomasterfrom
fix/harden-pipeline-state-reload
Draft

Harden pipeline interpreter state reload against transient MongoDB errors#25894
patrickmann wants to merge 1 commit intomasterfrom
fix/harden-pipeline-state-reload

Conversation

@patrickmann
Copy link
Copy Markdown
Contributor

@patrickmann patrickmann commented May 5, 2026

Closes #25750

Description

PipelineInterpreterStateUpdater.reloadAndSave() could silently replace a valid pipeline state with an empty one when MongoDB hit a transient error during an event-triggered reload. Messages processed in that window bypassed all pipeline rules and landed in the default stream.

This PR implements three of the four fixes from #25750:

  1. Let MongoException propagate from MongoDbRuleService and MongoDbPipelineService loadAll() and friends, instead of swallowing it and returning an empty set. Transient MongoDB failures now fail loudly, so callers can react. Other callers (REST resources, content packs, migrations) get a 500 on transient failure, which is correct.

  2. Migrate state reload to a new PipelineInterpreterStateReloadJob (SystemJob) submitted via SystemJobManager. On failure the job retries with a 1 second delay. The constructor of PipelineInterpreterStateUpdater now performs the synchronous initial state load before registering on the event bus, closing the startup race window described in Pipeline rules not applied during multi-node restart due to async state reload race #25745. Pattern follows the existing PipelineMetadataUpdateJob.

  3. PipelineInterpreterStateUpdater.updateState() refuses to replace a non-empty state with an empty one and logs at WARN. Defense in depth.

  4. Null safety in PipelineInterpreter.process(). If getLatestState() returns null, messages pass through unchanged with a warning log instead of NPE. The companion change for IlluminateMessageProcessor.process() is in Graylog2/graylog-plugin-enterprise#14157.

Note on retry policy: SystemJobResult.withRetry requires maxRetries == Integer.MAX_VALUE until per-trigger retry tracking lands in the system scheduler.

How Tested

  • Manual: start a single-node Graylog with one pipeline attached to a stream and verify message processing. Edit the pipeline rule via the UI, confirm the new rule takes effect within a few seconds. Stop MongoDB briefly while editing another rule, then restart MongoDB, and verify the system job retries (server log shows Failed to reload pipeline interpreter state, retrying) and pipeline state is eventually rebuilt with no empty-state interval observed in message processing.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactoring (non-breaking change)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have requested a documentation update.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.

…rors

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Harden pipeline interpreter state reload against transient MongoDB errors

1 participant