Skip to content

read_storage: non-deterministic checkpoint hash for single-file paths #1640

@ilongin

Description

@ilongin

When read_storage() or read_csv() is called with a single file path (e.g. read_csv("file.csv") or read_storage("s3://bucket/data.csv")), get_listing() detects it's a single file and returns list_ds_name=None. This skips the listing path entirely and falls through to read_values which creates a temp dataset with a random UUID name. The starting checkpoint hash is derived from this random name, making it different on every run — so checkpoints can never be reused.

Directory paths and globs work fine since they go through the listing path which produces a deterministic hash.

Proposed fix: use the file URI + etag as the starting hash. A single HEAD request (S3/GCS/Azure) or stat call (local) to get the etag before computing the hash. This detects content changes — if the file was edited, the etag is different, so checkpoints re-run. If the file is unchanged, same etag, checkpoints reused.

This is consistent with the approach in #1639 where directory listings also use etags as the source of truth for "did the data change". Single file = one stat call, directory = streaming query over listing table.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions