read_dataset: store chain hash in dataset version

Currently `read_dataset()` uses `sha256(dataset.name + version)` as the starting checkpoint hash. This is unsafe — someone can delete a version and recreate it with different data, and the hash stays the same. It also doesn't survive checkpoint cleanup, so global checkpoint reuse is impossible.

Proposal: store the chain hash that produced the dataset in a new `hash` field on the dataset version record. When `.save()` creates a dataset version, populate this field with the computed chain hash. When `read_dataset()` is used in another chain, use `dataset_version.hash` as the starting hash instead of the URI-based one.

This also means `.save()` no longer needs to create checkpoints. Instead of looking up the checkpoint model, `.save()` searches dataset versions for a matching hash — if found, the dataset already exists with this exact computation, skip the chain. The dataset version itself becomes the checkpoint for `.save()`.

Checkpoints then become UDF-specific — only needed for UDF step skip/resume (including partial completion recovery). `.save()` uses dataset_version.hash.

This enables global checkpoints for `.save()` — if the same chain with the same inputs was run months ago in a different job, the dataset version still exists with the matching hash, so the chain is skipped. No dependency on checkpoint expiry.

Notes:
- Add an index on the new `hash` field for fast lookups
- If `dataset_version.hash` is `None` (dataset created externally, not through a chain), fall back to a safe default

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_dataset: store chain hash in dataset version #1642

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

read_dataset: store chain hash in dataset version #1642

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions