-
Notifications
You must be signed in to change notification settings - Fork 140
Open
Labels
Description
Currently read_storage() uses listing dataset name as the starting checkpoint hash. This has two problems:
- Bug: the hash is computed before the listing step runs, so it uses just the listing dataset name (not versioned URI). With
update=Trueand new files arriving, the hash stays the same and stale checkpoints are reused. (Extracted to Fix read_storage hash computed before listing resolves #1655) - Unnecessary invalidation: if we fix (1) by using the versioned URI, every new listing creates a new version which changes the hash — invalidating all downstream checkpoints even when no files actually changed.
The fix is to compute a deterministic hash from file etags of the filtered listing output. Stream etags sorted by file__path from the listing table and compute an incremental SHA256 — O(1) Python memory.
This way:
- New files arrive → etags change → hash changes → checkpoints re-run
- Same files, new listing → same etags → same hash → checkpoints reused
Open questions:
file__pathis not indexed (sorting key issys__id), soORDER BY file__pathon very large listings could be expensive. May need an index or an order-independent aggregation approach.- Mid-chain
read_storagevia.union()has the same problem but can be deferred.
Reactions are currently unavailable