Skip to content

checkpoints: global reuse across jobs #1643

@ilongin

Description

@ilongin

Currently UDF checkpoints are scoped to a job's run group — _find_udf_checkpoint only searches by rerun_from_job_id. Combined with checkpoint cleanup, this means previously computed results can't be reused after expiry, even if the exact same computation was done before.

With #1641 (removing transient dependencies) and #1642 (storing chain hash in dataset version), chain hashes become self-contained and stable across jobs. This enables global checkpoint reuse.

Two levels:

.save() — use dataset_version.hash (#1642)
No checkpoint needed. .save() searches dataset versions for a matching hash — if found, the dataset already exists, skip the chain. The dataset version itself is the checkpoint and never expires.

UDF steps — global checkpoint search
The UDF skip path (_skip_udf) already copies output tables from previous jobs. The change is to search globally for a matching hash_output across all jobs, not just rerun_from_job_id. If a matching checkpoint + output table exists from any previous job, reuse it.

The UDF continue/resume path (_continue_udf) stays job-scoped — it depends on a specific parent job's partial output table for crash recovery.

This means:

  • .save() stops creating checkpoints entirely, uses dataset_version.hash instead
  • UDF skip becomes global — any previous job's output can be reused
  • UDF continue/resume stays per-job (partial output is job-specific)
  • Checkpoint cleanup no longer causes unnecessary re-computation for .save() (dataset versions persist)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions