-
Notifications
You must be signed in to change notification settings - Fork 140
Open
Labels
Description
Currently each chain's checkpoint hash includes the hash of the last checkpoint in the current job. This means if a script has two independent chains:
dc.read_storage("s3://a/").filter(...).save("ds1")
dc.read_storage("s3://b/").map(...).save("ds2")ds2's hash depends on ds1's checkpoint. Changing ds1's logic, reordering the chains, or wrapping one in an if invalidates ds2's checkpoint even though its own inputs and steps haven't changed.
This transient dependency was meant to handle cases like read_dataset("d1") depending on a previously saved d1, but that dependency is already captured naturally β read_dataset("d1") includes the dataset name/version in its QueryStep.hash().
Proposal: remove _last_checkpoint_hash from the hash calculation. Each chain's hash should depend solely on its own starting data and steps.
Benefits:
- Easier to reason about β no surprising invalidations from unrelated chains
- Robust to code reordering, conditionals, and script restructuring
- Enables global checkpoints β if the same chain with the same inputs ran in another job months ago, the checkpoint can be reused
Reactions are currently unavailable