read_storage: content-based checkpoint hash from file etags

Currently `read_storage()` uses listing dataset name as the starting checkpoint hash. This has two problems:

1. **Bug**: the hash is computed before the listing step runs, so it uses just the listing dataset name (not versioned URI). With `update=True` and new files arriving, the hash stays the same and stale checkpoints are reused. (Extracted to #1655)
2. **Unnecessary invalidation**: if we fix (1) by using the versioned URI, every new listing creates a new version which changes the hash — invalidating all downstream checkpoints even when no files actually changed.

The fix is to compute a deterministic hash from file etags of the filtered listing output. Stream etags sorted by `file__path` from the listing table and compute an incremental SHA256 — O(1) Python memory.

This way:
- New files arrive → etags change → hash changes → checkpoints re-run
- Same files, new listing → same etags → same hash → checkpoints reused

Open questions:
- `file__path` is not indexed (sorting key is `sys__id`), so `ORDER BY file__path` on very large listings could be expensive. May need an index or an order-independent aggregation approach.
- Mid-chain `read_storage` via `.union()` has the same problem but can be deferred.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_storage: content-based checkpoint hash from file etags #1639

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

read_storage: content-based checkpoint hash from file etags #1639

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions