Track superseded mempool errors separately by tilacog · Pull Request #4385 · cowprotocol/services

tilacog · 2026-05-05T19:37:57Z

Description

Mempools::execute() races configured mempools concurrently and returns the first one that succeeds. Previously, errors from mempools that lost the race were counted as real failures, even though the overall submission was successful. Dropped mempools were never recorded.

This skewed mempool counts by both:

counting errors alongside a successful submission, and
omitting counts for dropped mempools.

This PR keeps the racing behavior but changes how outcomes interact with the metrics.

Changes

Behavior

On first success, score a Success metric for the winning mempool and Superseded for every other configured mempool.
On all-failed, score a Failed metric for each one of them.

Code specific (all in `crates/driver`)

domain/mempools.rs:
- Introduce a private Outcome enum, representing the bare outcome of a mempool submission.
- Mempools::execute defers updating the metrics after select_ok resolves (once it has all the relevant Outcomes).
infra/observe/mod.rs:
- Logs emission and metric scoring functions were split, effectivelly replacing mempool_executed with two narrower helpers:
  - mempool_log: emits only the per-attempt log line, no metrics.
  - mempool_submission_result: increases the metric driver_mempool_submission counter (and the ..._blocks_passed and counter when timing is known) using the final, reclassified label produced by update_metrics.

How to test

Existing driver unit tests cover the race semantics; this PR does not change the externally observable submission outcome, only how observation is sequenced and labeled. To verify manually:

Run the driver against a config with at least two mempools.
Trigger a settlement that succeeds via the public mempool.
Confirm Prometheus shows one result="Success" increment for the winner and one result="Superseded" increment for the loser; no Revert/Expired/Other from the loser.
Trigger a settlement that fails on every mempool and confirm each mempool gets its own non-Superseded failure label.

Alert query update needed when deploying

Per-mempool success counts both wins and races-lost (so happy and failure paths both emit N events for N configured mempools, keeping the ratio symmetric). Superseded stays a separate label so dashboards can still distinguish wins from race-losses (and race-losses-that-errored) per mempool.

sum by (network) (increase(driver_mempool_submission{cow_fi_environment="prod",result=~"Success|Superseded"}[2h]))
/
sum by (network) (increase(driver_mempool_submission{cow_fi_environment="prod",result!="Disabled"}[2h])) < 0.6

When `Mempools::execute()` runs mempools in parallel, errors from mempools whose results were discarded after another mempool succeeded were still recorded against `driver_mempool_submission`, biasing the per-mempool success ratio with timing-dependent shadowed failures. Replace `select_ok` with `FuturesUnordered` + manual loop so observation runs in the consuming context. Errors that occur before another mempool succeeds are now recorded under a new `Superseded` label via `observe::mempool_superseded`, which also records the winning mempool in the trace fields. Errors in the all-failed case keep their existing labels (Revert / Expired / Other / Disabled). Alert query update needed when deploying: sum by (network) (increase(driver_mempool_submission{cow_fi_environment="prod",result="Success"}[2h])) / sum by (network) (increase(driver_mempool_submission{cow_fi_environment="prod",result!~"Disabled|Superseded"}[2h])) < 0.6

`mempool_executed` took a `Result<&SubmissionSuccess, &mempools::Error>` and re-matched the same discriminant several times to pick the log level, metric label, and block-passed labels. Replace it with two functions, `mempool_succeeded(&SubmissionSuccess)` and `mempool_failed(&mempools::Error)`, so each branch is straight-line and call sites pick the correct observer directly. Behavior and emitted metrics are unchanged.

fleupold

Is there a reason you are not using the PR template for the description?

I agree with the change, however I'd like to suggest that we interpret "superseeded" events as success wrt. how you envision to change the metric. A superseeded submission should be considered a successful one.

This way we receive N (# of mempool) events in the happy case, and N events in the failure case allowing us to keep our alert metric as a ratio of successful to failed ones (otherwise failed events would be weighted N times more than successful ones).

Every loser in a mempool race is now marked Superseded, whether it failed before the winner finished or was still in flight when the winner landed. The old code only labelled already-failed losers as superseded and quietly dropped ones still in flight; the shadowed_errors accumulator that carried their errors across is gone. Minor cleanup: - Error::blocks_passed on the domain type returns the block delta from submission to the terminal event for variants that carry block-level timing. This replaces the inline match in mempool_failed. - error_label is shared between mempool_failed and the per-attempt counter so the Prometheus labels stay in sync. The all-failed path also swaps the expect for an explicit Error::Other fallback instead of panicking on the (currently unreachable) empty-errors case.

tilacog · 2026-05-08T19:33:23Z

Is there a reason you are not using the PR template for the description?

Apologies, I was in a rush and didn't account for that. I've updated the description to match the template.

I agree with the change, however I'd like to suggest that we interpret "superseeded" events as success wrt. how you envision to change the metric. A superseeded submission should be considered a successful one.

This way we receive N (# of mempool) events in the happy case, and N events in the failure case allowing us to keep our alert metric as a ratio of successful to failed ones (otherwise failed events would be weighted N times more than successful ones).

Agree. I've adjusted the suggested metric.

gemini-code-assist

Code Review

This pull request refactors the mempool execution logic to use FuturesUnordered, enabling more detailed tracking of success, failure, and superseded states. It also adds a blocks_passed method to the Error enum for improved block-level timing metrics. A high-severity logic error was identified where disabled mempools are incorrectly reported as 'Superseded' if another mempool wins the race, which would artificially inflate success rate metrics. A correction was suggested to preserve the 'Disabled' status during the racing process.

Disabled is a configuration skip, not a submission failure. Split it into its own observer so failure-rate metrics aren't polluted.

Co-authored-by: José Duarte <15343819+jmg-duarte@users.noreply.github.com>

We've established this during the review, fixing this small regression

Replace the pre-race disabled filter with an Outcome enum that records per-mempool state (Pending/Success/Failed/Disabled) as futures settle. update_metrics observes per-mempool labels after the race, and reconstruct_result collapses the stats into the caller's Result — first Success in config order wins, else first non-Disabled error, else Error::Disabled. This makes the surfaced error deterministic and keeps Disabled out of the Superseded bucket.

tilacog · 2026-05-15T18:45:28Z

Ok, here's what I landed on after addressing @MartinquaXD's comments:

I didn't use a Dashmap or Hashmap because hashing Mempool isn't straightforward (due to an float field in Config), so I kept a vector of pairs instead (Mempool, Outcome) and handled them by position.

I had to use FuturesUnordered instead of select_ok because sending submit's results out of the async closure wasn't possible due to lifetime constraints, plus mempool::Error isn't Clone since anyhow::Error isn't either... and I didn't want to add more complexity there.

So I followed the suggestions as closely as I could:

I used an Outcome enum to capture (and own) errors and submissions results by consuming the results from submit calls, and processed them in their metrics pipeline, finally recreating the returning result to avoid breaking the contract with execute callers (currently just Competition::process_settle_request). I added a few unit tests for that part since they were easy to write.

I think this is in good shape to merge. If we don't see any correctness issues, I'd recommend merging right away and improving the code later, because I think this will help with our current alerts on high mempool submission errors.

Short-circuits on the first Success instead of carrying it through the remaining outcomes. Drops the manual_try_fold allow since fold_while expresses the state machine directly.

MartinquaXD

It seems like logging the submission outcome and updating the labels in a single function lands you in hot water. I think the least invasive solution would be:

keep using select_ok()
log result in select_ok()
have 1 generic function that increments the metric for a given mempool based on the given Outcome.

MartinquaXD · 2026-05-18T09:00:05Z

+        while let Some((idx, outcome)) = submission_futures.next().await {
+            stats[idx] = outcome
+        }


AFAICS there is no need to introduce the FuturesUnordered. Using select_ok() protects us from the case where there is already some future that returned success but some other futures gets stuck and doesn't finish.

let mut stats: Vec<Outcome> = std::iter::repeat_with(|| Outcome::Pending) .take(self.mempools.len()) .collect(); let res = select_ok( self.mempools .iter() .zip(stats.iter_mut()) .map(|(mempool, stat)| { async move { let result = self .submit(mempool, settlement, submission_deadline, mode) .instrument(tracing::info_span!("mempool", kind = mempool.to_string())) .await; *stat = Outcome::from(&result); result } .boxed() }), ) .await; self.update_metrics(settlement, &stats); Ok(res?.0.tx_hash)

Good call, thanks!
I really missed the short-circuiting logic between the last commits.

(I'll revisit this note after considering your suggestion for splitting logs and metrics, so what follows is just the initial part of the fix)

I'm afraid we really need FuturesUnordered here, under the premise that Outcome owns submit results. The reason is we can’t create an Outcome from a &result inside an async task sent to select_ok, because:

It’s short-lived, so we can’t borrow it.

We can’t clone it, since anyhow::Error isn’t Clone.

The current approach moves error values back and forth. Using FuturesUnordered lets us pull results into the caller’s scope, move them into Outcome, and then extract them again to build the expected return for execute.

To preserve the safety you mentioned, I added a small follow-up in 1cf8033 to exit the while let loop early, reproducing the previous select_ok behavior.

Let me know if I’m missing anything!

Martin and I had a quick sync, and we decided on the this next set of changes:

We will split the code paths for logs and metrics. This will simplify the creation of Outcome values, which can then become stateless enums (meaning they can be assigned from within the select_ok! task scope. (So no need for FuturesUnordered anymore)

We should log all errors, even if their submissions turn out to be superseded in the round.

It's OK to use mempool names as their IDs (we already do this for the metric labels)

After all that, we can simplify the stats vector creation.

MartinquaXD · 2026-05-18T09:11:50Z

+    /// two sets of values: 1) a few errors, a success and some pending,
+    /// disabled; or 2) all errors plus disabled.
+    fn update_metrics(&self, settlement: &Settlement, stats: &[Outcome]) {
+        // SAFETY: using `zip_eq` instead of `zip` here to catch regressions early on


This function looks very complicated. Wouldn't something like this also work?

let some_submission_successful = stats .iter() .any(|outcome| matches!(outcome, Outcome::Success(_))); for (mempool, outcome) in self.mempools.iter().zip_eq(stats.iter()) { match outcome { Outcome::Failed(error) if some_submission_successful => { // if there was a successful submission we assume the errors are // false-positivies due to race conditions during the // submission. observe::mempool_outcome(mempool, Outcome::Superseeded), } _ => observe::mempool_outcome(mempool, outcome), } }

Thanks for the suggestion. Addressed in 3551bb6

MartinquaXD · 2026-05-18T09:20:08Z

+/// Label used for the `mempool_submission` metric when submission succeeds.
+const MEMPOOL_SUBMISSION_SUCCESS_LABEL: &str = "Success";


You could have a single function like metric_label(&Outcome) which contains those strings inside and calls err.metric_label().

Addressed in 5eb4b93 & f6cb0a3

Collapses the winner/no-winner branches into a single loop. Moves per-variant dispatch to `Outcome::observe`, reusing `Pending` as the "lost the race" marker so no synthetic enum variant is needed. `mempool_superseded` drops its now-unused `winner` argument.

Previously the loop awaited every submission future before returning. A stuck mempool future would hang `execute` indefinitely even after another mempool already succeeded. Break on the first Success; remaining futures stop being polled and their slots stay `Pending`, which `update_metrics` already reports as `Superseded`.

The `'static` bound was unnecessary — these functions only read the slice to pass it as a metric label. Drop it so callers aren't forced to provide a static lifetime. The label producer (`Outcome::metric_label`) still returns `&'static str`.

Drop `FuturesUnordered` + manual break-on-success in favor of `select_ok`, which already protects against a stuck future hanging the race, returns the first success or last error directly, and writes each outcome into its pre-allocated `stats` slot via `iter_mut`. `reconstruct_result` is gone. Split observability so errors from mempools that later get superseded still surface in logs. `observe::mempool_log` runs inline from each racing task and only logs. `observe::mempool_submission_result` runs once per mempool from `update_metrics` after the race resolves and emits both Prometheus counters (`mempool_submission` and `mempool_submission_results_blocks_passed`) with one shared label, so the two cannot drift on the `Failed`->`Superseded` reclassification. `Outcome` carries the data each state has: `Success` and `Failed` own `blocks_passed`; `Failed` also carries a stateless `FailureReason` enum for the `Revert`/`Expired`/`Other` subtype label. The exhaustive `From<&Result<...>>` impl makes the compiler reject any new `Error` variant that hasn't been classified.

MartinquaXD

The diff is still larger than anticipated but that seems due to some existing code being moved into separate functions and some re-formatting which is alright.
The actual logic change is now very small.

Already approving as the last comments are mostly nits.

MartinquaXD · 2026-05-19T15:54:26Z

+        // practice).
+        for (mempool, &outcome) in self.mempools.iter().zip_eq(stats.iter()) {
+            let label = match outcome {
+                Outcome::Failed { .. } if winner_exists => "Superseded",


This string literal has the risk of drifting. I think the less error prone option would be to rename Pending to Superseded and do Outcome::Superseded.metric_label() instead of the string literal here.

Renamed in 807c325

MartinquaXD · 2026-05-19T15:57:59Z

+#[derive(Clone, Copy)]
+enum FailureReason {


Given that we only capture the FailureReason as a label to increase metrics with having a full enum here seems a little much. Directly storing a String or &'static str in the Failed variant seems to be sufficient.

Addressed in bf24685

MartinquaXD · 2026-05-19T15:58:56Z

+#[derive(Clone, Copy)]
+enum Outcome {


nit: pet peeve of Jose's is that there should not be a bunch of other stuff between struct and impl blocks (in this case of Mempools).

Addressed in fb4d44e

Eliminates a drifting string literal: the metric layer reads Outcome::Superseded.metric_label() instead of hardcoding "Superseded".

The FailureReason enum existed only to feed a metric label; storing the label string directly removes a layer of indirection.

Keep struct and its impl block adjacent.

tilacog force-pushed the mempool-metric-superseded branch 2 times, most recently from a798fc3 to 9905e6e Compare May 6, 2026 15:39

This comment was marked as outdated.

Sign in to view

tilacog added 7 commits May 6, 2026 12:54

Add comments to mempool race logic

fe5207d

minor adjustments

844a156

minor adjustments

d4d060d

fmt

0ed1753

Document dropped futures in mempool race

d9fb0cb

tilacog force-pushed the mempool-metric-superseded branch from 9905e6e to d9fb0cb Compare May 6, 2026 15:55

cowprotocol deleted a comment from github-actions Bot May 6, 2026

fleupold reviewed May 7, 2026

View reviewed changes

tilacog marked this pull request as ready for review May 8, 2026 19:33

tilacog requested a review from a team as a code owner May 8, 2026 19:33

gemini-code-assist Bot reviewed May 8, 2026

View reviewed changes

Comment thread crates/driver/src/domain/mempools.rs Outdated

tilacog added 3 commits May 8, 2026 16:56

extract is_disabled as a function

4a27081

Skip disabled mempools before race

80e7539

Merge branch 'main' into mempool-metric-superseded

7d2b845

tilacog enabled auto-merge May 11, 2026 11:45

jmg-duarte reviewed May 11, 2026

View reviewed changes

tilacog and others added 7 commits May 11, 2026 09:59

Filter disabled mempools inline

bc8b4c2

Clarify doc comments on mempools::is_disabled

605f650

Tag skipped mempools as Disabled, not Failed

d61152e

Disabled is a configuration skip, not a submission failure. Split it into its own observer so failure-rate metrics aren't polluted.

Add is_enabled helper for readability

da45a12

Rename variable: other -> mempool

013e884

Co-authored-by: José Duarte <15343819+jmg-duarte@users.noreply.github.com>

Fix stale variable names in mempool superseded loop

06eaad1

Reword doc-comment

5fbfefc

tilacog added 6 commits May 12, 2026 19:38

filter disabled mempools and score them in the same pass

75042e4

We've established this during the review, fixing this small regression

Refactor reconstruct_result via fold; doc update_metrics

58da54c

Allow manual_try_fold in reconstruct_result

79db6cf

Remove unused PartialEq impl for Mempool

20bec4b

Add unit tests for reconstruct_result

b55296a

tilacog requested review from MartinquaXD and extrawurst May 15, 2026 18:45

tilacog added 3 commits May 15, 2026 15:50

Remove unused PartialEq derives on Config and RevertProtection

70896f4

Merge branch 'main' into mempool-metric-superseded

e71710d

Use fold_while in reconstruct_result for early exit on Success

f45823b

Short-circuits on the first Success instead of carrying it through the remaining outcomes. Drops the manual_try_fold allow since fold_while expresses the state machine directly.

MartinquaXD reviewed May 18, 2026

View reviewed changes

tilacog added 4 commits May 18, 2026 08:45

Centralize mempool outcome label via Outcome::metric_label

5eb4b93

Relax mempool observe label args to &str

f6cb0a3

The `'static` bound was unnecessary — these functions only read the slice to pass it as a metric label. Drop it so callers aren't forced to provide a static lifetime. The label producer (`Outcome::metric_label`) still returns `&'static str`.

tilacog requested a review from MartinquaXD May 18, 2026 18:17

tilacog added 3 commits May 19, 2026 12:25

Use vec! macro for Pending outcomes init

b38eb86

Merge branch 'main' into mempool-metric-superseded

aaf9a57

MartinquaXD approved these changes May 19, 2026

View reviewed changes

tilacog added 3 commits May 19, 2026 13:12

Rename Pending outcome to Superseded

807c325

Eliminates a drifting string literal: the metric layer reads Outcome::Superseded.metric_label() instead of hardcoding "Superseded".

Inline FailureReason as &'static str on Outcome::Failed

bf24685

The FailureReason enum existed only to feed a metric label; storing the label string directly removes a layer of indirection.

Move Outcome enum below impl Mempools

fb4d44e

Keep struct and its impl block adjacent.

tilacog added this pull request to the merge queue May 19, 2026

Merged via the queue into main with commit 3bc9afe May 19, 2026
20 checks passed

tilacog deleted the mempool-metric-superseded branch May 19, 2026 16:41

github-actions Bot locked and limited conversation to collaborators May 19, 2026

		/// Label used for the `mempool_submission` metric when submission succeeds.
		const MEMPOOL_SUBMISSION_SUCCESS_LABEL: &str = "Success";

Conversation

tilacog commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Behavior

Code specific (all in crates/driver)

How to test

Alert query update needed when deploying

Uh oh!

This comment was marked as outdated.

fleupold left a comment

Choose a reason for hiding this comment

Uh oh!

tilacog commented May 8, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tilacog commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MartinquaXD left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tilacog May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tilacog May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tilacog May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MartinquaXD left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tilacog commented May 5, 2026 •

edited

Loading

Code specific (all in `crates/driver`)

tilacog commented May 15, 2026 •

edited

Loading

tilacog May 18, 2026 •

edited

Loading

tilacog May 19, 2026 •

edited

Loading

tilacog May 18, 2026 •

edited

Loading