Skip to content

read_storage: content-based checkpoint hash from file etags #1639

@ilongin

Description

@ilongin

Currently read_storage() uses listing dataset name as the starting checkpoint hash. This has two problems:

  1. Bug: the hash is computed before the listing step runs, so it uses just the listing dataset name (not versioned URI). With update=True and new files arriving, the hash stays the same and stale checkpoints are reused. (Extracted to Fix read_storage hash computed before listing resolves #1655)
  2. Unnecessary invalidation: if we fix (1) by using the versioned URI, every new listing creates a new version which changes the hash — invalidating all downstream checkpoints even when no files actually changed.

The fix is to compute a deterministic hash from file etags of the filtered listing output. Stream etags sorted by file__path from the listing table and compute an incremental SHA256 — O(1) Python memory.

This way:

  • New files arrive → etags change → hash changes → checkpoints re-run
  • Same files, new listing → same etags → same hash → checkpoints reused

Open questions:

  • file__path is not indexed (sorting key is sys__id), so ORDER BY file__path on very large listings could be expensive. May need an index or an order-independent aggregation approach.
  • Mid-chain read_storage via .union() has the same problem but can be deferred.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions