read_storage: non-deterministic checkpoint hash for single-file paths

When `read_storage()` or `read_csv()` is called with a single file path (e.g. `read_csv("file.csv")` or `read_storage("s3://bucket/data.csv")`), `get_listing()` detects it's a single file and returns `list_ds_name=None`. This skips the listing path entirely and falls through to `read_values` which creates a temp dataset with a random UUID name. The starting checkpoint hash is derived from this random name, making it different on every run — so checkpoints can never be reused.

Directory paths and globs work fine since they go through the listing path which produces a deterministic hash.

Proposed fix: use the file URI + etag as the starting hash. A single `HEAD` request (S3/GCS/Azure) or `stat` call (local) to get the etag before computing the hash. This detects content changes — if the file was edited, the etag is different, so checkpoints re-run. If the file is unchanged, same etag, checkpoints reused.

This is consistent with the approach in #1639 where directory listings also use etags as the source of truth for "did the data change". Single file = one stat call, directory = streaming query over listing table.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_storage: non-deterministic checkpoint hash for single-file paths #1640

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

read_storage: non-deterministic checkpoint hash for single-file paths #1640

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions