Add monitoring for config digest missmatch#2068
Conversation
|
👋 emate, thanks for creating this pull request! To help reviewers, please consider creating future PRs as drafts first. This allows you to self-review and make any final changes before notifying the team. Once you're ready, you can mark it as "Ready for review" to request feedback. Thanks! |
There was a problem hiding this comment.
Pull request overview
Adds per-round monitoring of config-digest mismatches between the home chain and the offramp for both the exec and commit plugins. A shared helper is introduced and used to refactor existing digest checks, and a new TrackConfigDigestMismatch metric (Prometheus + Beholder gauge) is emitted from each plugin's Observation path.
Changes:
- New
plugincommon.ConfigDigestsMatch/FormatConfigDigesthelpers, reused byexecute/observation.go(checkConfigDigest) andcommit/report.go(validateReport). - New
TrackConfigDigestMismatch(bool)reporter method with Prom + Beholder gauges (ccip_exec_config_digest_mismatch,ccip_commit_config_digest_mismatch); called every round inPlugin.Observation. - Tests updated to mock
GetOffRampConfigDigestand to wire aNoopobserver/reporter.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/plugincommon/config_digest.go | New shared digest comparison and hex-format helpers. |
| execute/observation.go | Emits mismatch metric per round; refactors checkConfigDigest to use the helper. |
| execute/metrics/reporter.go | Adds TrackConfigDigestMismatch to the Reporter interface and Noop. |
| execute/metrics/prom.go | Registers Prom + Beholder gauges and implements the tracker. |
| execute/observation_test.go, execute/plugin_test.go | Wire observer and mock GetOffRampConfigDigest. |
| commit/plugin.go | Emits mismatch metric per round in Observation. |
| commit/report.go | Refactors digest check in validateReport to use the helper. |
| commit/metrics/reporter.go | Adds TrackConfigDigestMismatch to interfaces and Noop. |
| commit/metrics/prom.go | Registers Prom + Beholder gauges and implements the tracker. |
| commit/plugin_test.go, commit/plugin_roledon_e2e_test.go | Mock GetOffRampConfigDigest for new path. |
Comments suppressed due to low confidence (1)
commit/metrics/prom.go:175
- Two struct fields were collapsed onto a single line, breaking the formatting and grouping with the unrelated
processorLatencyHistogramblock. PlaceconfigDigestMismatch: promCommitConfigDigestMismatch,on its own line (preferably alongside the other Prom gauge fields, e.g., right afterlooppProviderSupported) and restore the blank line separator between the Prom and beholder field groups.
looppProviderSupported: promLooppCCIPProviderSupported, configDigestMismatch: promCommitConfigDigestMismatch,
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
| commitLatestRound: promCommitLatestRoundID, | ||
| looppProviderSupported: promLooppCCIPProviderSupported, | ||
|
|
||
| looppProviderSupported: promLooppCCIPProviderSupported, configDigestMismatch: promCommitConfigDigestMismatch, |
There was a problem hiding this comment.
nit: I think adding configDigestMismatch on its own line (nearer to the bottom) would be more in line with expected formatting.
| p.bhConfigDigestMismatch.Record(context.Background(), int64(value), metric.WithAttributes( | ||
| attribute.String("chainFamily", p.chainFamily), | ||
| attribute.String("chainID", p.chainID), | ||
| )) |
There was a problem hiding this comment.
question: I see the use of context.Background() a lot in these Track* methods. I'm wondering what these methods actually do: is it blocking network I/O or is it some kind of IPC (to a statsd daemon or similar)?
There was a problem hiding this comment.
The beholder Record/Add calls are non-blocking ops - they buffer metrics for async export
| ctx, p.ccipReader, consts.PluginTypeCommit, p.reportingCfg.ConfigDigest, | ||
| ) | ||
| if err != nil { | ||
| lggr.Errorw("failed to check config digest", "err", err) |
There was a problem hiding this comment.
nit: more descriptive error message.
| lggr.Errorw("failed to check config digest", "err", err) | |
| lggr.Errorw("failed to check for config digest mismatch", "err", err) |
There was a problem hiding this comment.
fixed, added additional labels
"homeChainConfigDigest", p.reportingCfg.ConfigDigest,
"pluginType", consts.PluginTypeExecute,
)
| promCommitConfigDigestMismatch = promauto.NewGaugeVec(prometheus.GaugeOpts{ | ||
| Name: "ccip_commit_config_digest_mismatch", | ||
| Help: "Reports whether the home chain config digest differs from the offramp config digest (1 = mismatch, 0 = match)", | ||
| }, []string{"chainFamily", "chainID"}) |
There was a problem hiding this comment.
question: this metric has "chainFamily" in the string slice, whereas the previous metric has "chain_family" (promLooppCCIPProviderSupported) and one pretty far up this list also has "chainFamily". Should we settle on one instead and preferably use a const to refer to this label?
There was a problem hiding this comment.
fair, adjusted the label names to what we currently have
| Help: "Tracks whether LOOPP CCIP provider is supported for each chain family (1 = supported, 0 = not supported)", | ||
| }, []string{"chain_family"}) | ||
| promCommitConfigDigestMismatch = promauto.NewGaugeVec(prometheus.GaugeOpts{ | ||
| Name: "ccip_commit_config_digest_mismatch", |
There was a problem hiding this comment.
question: should these metric names be const? They are referred to in multiple places by the string literal, which as we see in this file can quickly go out of sync.
There was a problem hiding this comment.
yeah, it can be the way to go. I'm just wondering if we should start with having one metric as const leaving the rest for the other PR is. WDYT @makramkd ?
There was a problem hiding this comment.
I think its fine if we do the cleanup in a follow-up.
| promExecConfigDigestMismatch = promauto.NewGaugeVec(prometheus.GaugeOpts{ | ||
| Name: "ccip_exec_config_digest_mismatch", | ||
| Help: "Reports whether the home chain config digest differs from the offramp config digest (1 = mismatch, 0 = match)", | ||
| }, []string{"chainFamily", "chainID"}) |
There was a problem hiding this comment.
question: same question here with "chainFamily" vs. "chain_family".
|
Add per-round config digest mismatch beholder metric for exec and commit plugins
Start emmiting
ccip_{exec|commit}_config_digest_mismatchgauge (1=mismatch, 0=match) every round by comparing the home chain config digest against the offramp's on-chain config digest. Both reads are cached, so no additional RPC load.Changes:
ConfigDigestsMatchhelperTrackConfigDigestMismatchto exec and commit metrics reporters (checkConfigDigest(exec) andvalidateReport(commit) to use the shared helper