-
Notifications
You must be signed in to change notification settings - Fork 140
Description
When read_storage() or read_csv() is called with a single file path (e.g. read_csv("file.csv") or read_storage("s3://bucket/data.csv")), get_listing() detects it's a single file and returns list_ds_name=None. This skips the listing path entirely and falls through to read_values which creates a temp dataset with a random UUID name. The starting checkpoint hash is derived from this random name, making it different on every run — so checkpoints can never be reused.
Directory paths and globs work fine since they go through the listing path which produces a deterministic hash.
Proposed fix: use the file URI + etag as the starting hash. A single HEAD request (S3/GCS/Azure) or stat call (local) to get the etag before computing the hash. This detects content changes — if the file was edited, the etag is different, so checkpoints re-run. If the file is unchanged, same etag, checkpoints reused.
This is consistent with the approach in #1639 where directory listings also use etags as the source of truth for "did the data change". Single file = one stat call, directory = streaming query over listing table.