Skip to content
This repository was archived by the owner on Jan 16, 2026. It is now read-only.

Conversation

@RealiCZ
Copy link

@RealiCZ RealiCZ commented Dec 29, 2025

Problem

The get_preimage method in OnlineHostBackend had an infinite retry loop with no timeout or backoff mechanism. This caused several issues:

  1. Infinite retries: If the Beacon node is unreachable or blob data is unavailable, the loop would retry forever
  2. Resource exhaustion: Continuous failed attempts consume CPU and network resources
  3. No backoff: Immediate retries create unnecessary load on external services
  4. Busy waiting: When no hint is available, the loop would spin without delay

This became apparent when encountering the error:
Failed to fetch blobs: Preimage oracle error: Channel is closed.

Solution

Added comprehensive retry protection following the project's existing retry patterns from crates/supervisor/service/src/actors/utils.rs:

Key Changes

  1. Timeout Protection: 30-second overall timeout per preimage fetch
  2. Retry Limit: Maximum 100 retry attempts before failing
  3. Exponential Backoff: Millisecond-level delays suitable for high-frequency operations
    • Progression: 100ms → 200ms → 400ms → 800ms → 1.6s → 3.2s (capped at 5s)
    • Uses saturating_pow and saturating_mul to prevent overflow
  4. Improved Logging:
    • Error logs show attempt count and total retries
    • Warn logs display the actual backoff delay
  5. Graceful Degradation: Handles missing hints with appropriate delays

Implementation Details

  • Created preimage_backoff_delay() function mirroring the project's backoff_delay() pattern
  • Millisecond-level backoff (vs. second-level) appropriate for preimage operations
  • Added early return for already-cached preimages
  • Structured logging with ?delay format for better observability

Testing

  • Code compiles without errors
  • Follows existing project patterns and style
  • No linter errors

Behavior Changes

  • Success case: No change - cached preimages return immediately
  • Retry case: Now includes exponential backoff delays (max 100 retries)
  • Timeout case: Returns PreimageOracleError::Timeout after 30 seconds
  • Failure case: Returns error after 100 attempts with detailed logging

Related Issues

Addresses scenarios where Beacon node connectivity issues or missing blob data would cause the host to hang indefinitely.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant