Replace percentile heuristic with MIN(flowtime) for backfill chunk selection by dmarulli · Pull Request #222 · California-Data-Collaborative/ami-connect

dmarulli · 2026-05-08T16:52:39Z

Summary

SnowflakeStorageSink.calculate_end_of_backfill_range used a percentile-threshold heuristic to identify the leading edge of already-backfilled data. The heuristic returns MIN(date) filtered to days whose read count exceeds 50% of the 75th-percentile of daily counts. This gets stuck when an org's leading edge has below-threshold per-day volume — typically the legitimate ramp-up period at the start of vendor data.

cadc_crescent's backfill DAG has been silently re-processing the same (2024-11-29, 2024-12-29) chunk every run for ~2 weeks. Xylem has data back to 2023-01-01 (the configured backfill min_date), but our MIN(flowtime) hasn't moved because the chunker keeps targeting data we already have.

Change

Replace the percentile heuristic with raw MIN(flowtime) over the configured range. After each successful chunk, MIN(flowtime) shifts earlier, so the next run picks an older chunk automatically. No state tracking, no per-org branches.

What was considered and rejected

Per-org Crescent override following the cadc_thousand_oaks pattern: works but adds another per-org hack on top of an existing one.
Explicit backfill_progress state table: more invasive, schema change, not needed for the immediate fix.
Keeping the heuristic with a smarter threshold: complex, doesn't solve the underlying conflation of "ramp-up" with "partial load."

Trade-off

The new approach gives up one narrow case the old heuristic handled: a transient ETL failure that loaded a partial day at the leading edge would self-heal on retry under the old logic. With raw MIN, a partial leading-edge day would persist; recovery requires a manual DAG run on that range. This case is narrow in practice (most partials are middle-of-range, where neither approach helps), and the data-gap quality check notify_on_failure() is False anyway, so the old behavior was self-healing without operator visibility.

Affected orgs

Only cadc_crescent has an active backfill DAG (verified via configuration_backfills). Spot-checked READINGS shape per org — only stray-shape candidates were cadc_thousand_oaks (handled by its own branch, untouched here) and cadc_moulton_niguel (no backfill DAG). No regression risk for any other org.

Follow-up

The cadc_thousand_oaks branch + orphan backfills table row will be cleaned up in a separate PR.

Test plan

Existing unit tests pass (239/239)
Deploy via `./deploy.sh cadc`
Watch next scheduled Crescent backfill run (every 2h, `0 */2 * * *`)
Confirm `Extracting data for range` log line shows older chunk (e.g. `2024-10-30 → 2024-11-29`) instead of `2024-11-29 → 2024-12-29`
Re-query Snowflake for `MIN(flowtime) WHERE org_id = 'cadc_crescent'` after a few runs — should march backward toward 2023-01-01

…lection The heuristic-based chunk selector got stuck on orgs whose leading edge of backfilled data had below-threshold per-day read volume (legitimate ramp-up at the start of vendor data). cadc_crescent's backfill stopped advancing at 2024-11-29 even though Xylem has data back to 2023-01-01. Replace with raw MIN(flowtime) -- each successful chunk pushes MIN earlier, so the next chunk advances naturally. No state tracking, no per-org heuristics. The cadc_thousand_oaks branch is left untouched (separate PR).

christophertull · 2026-05-13T23:23:09Z

@dmarulli this should be fine if it fixes the immediate need. I want to say the more complex min() logic was added because we encountered a situation with Valley County were the source data legitimately had a full - 2ish day data gap with no data in the source data. So in that case, grabbing the min datetime would get stuck in a way analogous to what happened here - the same day with zero reads would get fetched again and again.

dmarulli requested review from christophertull, jin-current, mdowell12 and pat-current as code owners May 8, 2026 16:52

dmarulli added 2 commits May 8, 2026 13:36

Drop unused time import, add tests for calculate_end_of_backfill_range

33f03ac

Restructure cursor call to satisfy Black 26 chained-call wrap rule

f88e9d2

dmarulli merged commit e91d898 into main May 8, 2026
2 checks passed

dmarulli mentioned this pull request May 11, 2026

Raise clear error when load step finds empty transformed extract #223

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace percentile heuristic with MIN(flowtime) for backfill chunk selection#222

Replace percentile heuristic with MIN(flowtime) for backfill chunk selection#222
dmarulli merged 3 commits into
mainfrom
dmarulli/crescent-backfill-unstick

dmarulli commented May 8, 2026

Uh oh!

Uh oh!

christophertull commented May 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dmarulli commented May 8, 2026

Summary

Change

What was considered and rejected

Trade-off

Affected orgs

Follow-up

Test plan

Uh oh!

Uh oh!

christophertull commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

christophertull commented May 13, 2026 •

edited

Loading