-
Notifications
You must be signed in to change notification settings - Fork 140
Description
Currently read_dataset() uses sha256(dataset.name + version) as the starting checkpoint hash. This is unsafe — someone can delete a version and recreate it with different data, and the hash stays the same. It also doesn't survive checkpoint cleanup, so global checkpoint reuse is impossible.
Proposal: store the chain hash that produced the dataset in a new hash field on the dataset version record. When .save() creates a dataset version, populate this field with the computed chain hash. When read_dataset() is used in another chain, use dataset_version.hash as the starting hash instead of the URI-based one.
This also means .save() no longer needs to create checkpoints. Instead of looking up the checkpoint model, .save() searches dataset versions for a matching hash — if found, the dataset already exists with this exact computation, skip the chain. The dataset version itself becomes the checkpoint for .save().
Checkpoints then become UDF-specific — only needed for UDF step skip/resume (including partial completion recovery). .save() uses dataset_version.hash.
This enables global checkpoints for .save() — if the same chain with the same inputs was run months ago in a different job, the dataset version still exists with the matching hash, so the chain is skipped. No dependency on checkpoint expiry.
Notes:
- Add an index on the new
hashfield for fast lookups - If
dataset_version.hashisNone(dataset created externally, not through a chain), fall back to a safe default