Skip to content

read_dataset: store chain hash in dataset version #1642

@ilongin

Description

@ilongin

Currently read_dataset() uses sha256(dataset.name + version) as the starting checkpoint hash. This is unsafe — someone can delete a version and recreate it with different data, and the hash stays the same. It also doesn't survive checkpoint cleanup, so global checkpoint reuse is impossible.

Proposal: store the chain hash that produced the dataset in a new hash field on the dataset version record. When .save() creates a dataset version, populate this field with the computed chain hash. When read_dataset() is used in another chain, use dataset_version.hash as the starting hash instead of the URI-based one.

This also means .save() no longer needs to create checkpoints. Instead of looking up the checkpoint model, .save() searches dataset versions for a matching hash — if found, the dataset already exists with this exact computation, skip the chain. The dataset version itself becomes the checkpoint for .save().

Checkpoints then become UDF-specific — only needed for UDF step skip/resume (including partial completion recovery). .save() uses dataset_version.hash.

This enables global checkpoints for .save() — if the same chain with the same inputs was run months ago in a different job, the dataset version still exists with the matching hash, so the chain is skipped. No dependency on checkpoint expiry.

Notes:

  • Add an index on the new hash field for fast lookups
  • If dataset_version.hash is None (dataset created externally, not through a chain), fall back to a safe default

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions