-
Notifications
You must be signed in to change notification settings - Fork 2k
resumable and progress-visible zpool import recovery scans (-F / -FX) #18383
Description
Describe the feature you would like to see added to OpenZFS
I would like OpenZFS to support resumable recovery scans for zpool import -F and especially zpool import -FX, or at minimum to provide a safe intermediate checkpoint / resume mechanism for the metadata scan phase.
Right now, zpool import -FX -n <pool> can run for many hours when trying to locate a viable rewind point. If the process is interrupted for any reason, all progress is lost and the scan must start again from zero. There also appears to be no built-in user-facing progress indicator that tells the operator whether the command is still making meaningful progress, what stage it is in, how far back it has scanned, or whether it is stuck retrying problematic reads.
This creates a very poor recovery experience in exactly the situations where users are already under stress and trying to avoid data loss.
Requested improvements, in order of usefulness:
-
resumable
zpool import -F/-FXrecovery scans- ability to stop and later continue the scan without losing all work
- stored checkpoint state that is safe to reuse only for the same pool / same device set / same import parameters
-
progress reporting for long-running recovery scans
- current phase
- approximate TXG / rewind search position
- approximate amount of metadata scanned
- per-device read retry / error counters if relevant
- indication whether the command is progressing or repeatedly stalling on the same region
-
explicit dry-run output improvements for
-n- clearer explanation of what has already been scanned
- clearer statement of whether a candidate rewind point has been found yet
- clearer distinction between "still searching", "search exhausted", and "blocked by I/O errors"
-
optional safer UX for destructive recovery paths
- clearer warnings when running recovery import commands without
-n - clearer warnings that a successful non-
-nrewind import can discard newer TXGs
- clearer warnings when running recovery import commands without
How will this feature improve OpenZFS?
This would improve OpenZFS in several important ways:
- It would make data recovery much safer and less frustrating.
- It would reduce wasted time when long-running recovery scans are interrupted.
- It would reduce the temptation for users to abort and retry commands blindly.
- It would make troubleshooting easier by separating "slow but progressing" from "effectively stuck" behavior.
- It would improve operator confidence during recovery situations involving interrupted resilvers, import failures, or possible media read issues.
- It would reduce avoidable stress and mistakes in one of the highest-risk parts of ZFS administration.
Currently, a user can wait many hours for zpool import -FX -n with no resumability and little visibility. If the command is interrupted, the user loses the entire scan effort. That is especially painful when the pool may still be recoverable and the user is intentionally trying to proceed cautiously with -n first.
Even if full resumability is not feasible due to the structure of the recovery scan, significantly better progress visibility would still be a major improvement.
Additional context
This request comes from a real recovery scenario where:
- a pool was discoverable via
zpool import - normal import failed with
I/O error - the pool had previously been in a resilvering state
zpool import -FX -nthen ran for hours- there was no clear indication whether it was close to success, still making progress, or effectively wasting time
- there was no way to pause and resume the scan later
The current behavior is particularly unfriendly because the safe workflow is to run -n first, but the dry-run path can still consume many hours with no resumability. In practice, this means the safe workflow is heavily penalized compared to riskier direct execution.
If full resumability is architecturally impossible, then even a documented explanation of why, combined with substantially improved progress reporting and explicit stuck/progress diagnostics, would still be very valuable.