-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Description
Summary
RewindToTimestamp uses a retry-verify loop after each FCU to work around an op-reth race condition where two back-to-back forkchoice updates for sibling blocks at the same height cause a safe head mismatch panic (reth#23205).
This issue tracks whether we will keep this workaround long term or remove it after it's no longer an issue upstream.
Current implementation (PR #19773)
After each FCU, verifyRewindState checks that all 3 engine heads (unsafe, safe, finalized) match the expected values. If they haven't converged, the FCU is retried after a 500ms delay, up to 20 attempts (10s total).
In practice on op-reth, the verify query itself provides enough delay for reth to flush state — most rewinds need 0 retries, with occasional 1-retry cases observed.
Failure mode
If the retry loop exhausts all attempts, RewindToTimestamp returns ErrRewindFCUHeadMismatch. Without this workaround, the rewind step fails and the node FCUs to a synthetic block instead, which can lead to panic: superAuthority supplied an identifier for the safe head which is not known to the engine errors.
Additional follow-up
- Medium: /tmp/optimism-pr19773.LB4KGV/op-supernode/supernode/chain_container/engine_controller/
rewind.go:205 retries on any verifyRewindState failure, not just the intended “heads haven’t
converged yet” case. That means a plain L2BlockRefByLabel read failure gets retried 20 times and is
finally reclassified as ErrRewindFCUHeadMismatch, even though the FCU itself may have succeeded.
In /tmp/optimism-pr19773.LB4KGV/op-supernode/supernode/chain_container/chain_container.go:517, that
wrapped error is treated as a temporary rewind error, so the caller will retry the whole rewind
instead of surfacing the real EL/RPC failure. I’d split “label read failed” from “label hash
mismatched” and only retry the latter.
Originally posted by @karlfloersch in #19773 (review)