Skip to content

checkpoints: remove transient dependency between chainsΒ #1641

@ilongin

Description

@ilongin

Currently each chain's checkpoint hash includes the hash of the last checkpoint in the current job. This means if a script has two independent chains:

dc.read_storage("s3://a/").filter(...).save("ds1")
dc.read_storage("s3://b/").map(...).save("ds2")

ds2's hash depends on ds1's checkpoint. Changing ds1's logic, reordering the chains, or wrapping one in an if invalidates ds2's checkpoint even though its own inputs and steps haven't changed.

This transient dependency was meant to handle cases like read_dataset("d1") depending on a previously saved d1, but that dependency is already captured naturally β€” read_dataset("d1") includes the dataset name/version in its QueryStep.hash().

Proposal: remove _last_checkpoint_hash from the hash calculation. Each chain's hash should depend solely on its own starting data and steps.

Benefits:

  • Easier to reason about β€” no surprising invalidations from unrelated chains
  • Robust to code reordering, conditionals, and script restructuring
  • Enables global checkpoints β€” if the same chain with the same inputs ran in another job months ago, the checkpoint can be reused

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions