Skip to content

Replace percentile heuristic with MIN(flowtime) for backfill chunk selection#222

Merged
dmarulli merged 3 commits into
mainfrom
dmarulli/crescent-backfill-unstick
May 8, 2026
Merged

Replace percentile heuristic with MIN(flowtime) for backfill chunk selection#222
dmarulli merged 3 commits into
mainfrom
dmarulli/crescent-backfill-unstick

Conversation

@dmarulli
Copy link
Copy Markdown
Collaborator

@dmarulli dmarulli commented May 8, 2026

Summary

SnowflakeStorageSink.calculate_end_of_backfill_range used a percentile-threshold heuristic to identify the leading edge of already-backfilled data. The heuristic returns MIN(date) filtered to days whose read count exceeds 50% of the 75th-percentile of daily counts. This gets stuck when an org's leading edge has below-threshold per-day volume — typically the legitimate ramp-up period at the start of vendor data.

cadc_crescent's backfill DAG has been silently re-processing the same (2024-11-29, 2024-12-29) chunk every run for ~2 weeks. Xylem has data back to 2023-01-01 (the configured backfill min_date), but our MIN(flowtime) hasn't moved because the chunker keeps targeting data we already have.

Change

Replace the percentile heuristic with raw MIN(flowtime) over the configured range. After each successful chunk, MIN(flowtime) shifts earlier, so the next run picks an older chunk automatically. No state tracking, no per-org branches.

What was considered and rejected

  • Per-org Crescent override following the cadc_thousand_oaks pattern: works but adds another per-org hack on top of an existing one.
  • Explicit backfill_progress state table: more invasive, schema change, not needed for the immediate fix.
  • Keeping the heuristic with a smarter threshold: complex, doesn't solve the underlying conflation of "ramp-up" with "partial load."

Trade-off

The new approach gives up one narrow case the old heuristic handled: a transient ETL failure that loaded a partial day at the leading edge would self-heal on retry under the old logic. With raw MIN, a partial leading-edge day would persist; recovery requires a manual DAG run on that range. This case is narrow in practice (most partials are middle-of-range, where neither approach helps), and the data-gap quality check notify_on_failure() is False anyway, so the old behavior was self-healing without operator visibility.

Affected orgs

Only cadc_crescent has an active backfill DAG (verified via configuration_backfills). Spot-checked READINGS shape per org — only stray-shape candidates were cadc_thousand_oaks (handled by its own branch, untouched here) and cadc_moulton_niguel (no backfill DAG). No regression risk for any other org.

Follow-up

The cadc_thousand_oaks branch + orphan backfills table row will be cleaned up in a separate PR.

Test plan

  • Existing unit tests pass (239/239)
  • Deploy via `./deploy.sh cadc`
  • Watch next scheduled Crescent backfill run (every 2h, `0 */2 * * *`)
  • Confirm `Extracting data for range` log line shows older chunk (e.g. `2024-10-30 → 2024-11-29`) instead of `2024-11-29 → 2024-12-29`
  • Re-query Snowflake for `MIN(flowtime) WHERE org_id = 'cadc_crescent'` after a few runs — should march backward toward 2023-01-01

…lection

The heuristic-based chunk selector got stuck on orgs whose leading edge of
backfilled data had below-threshold per-day read volume (legitimate ramp-up
at the start of vendor data). cadc_crescent's backfill stopped advancing
at 2024-11-29 even though Xylem has data back to 2023-01-01.

Replace with raw MIN(flowtime) -- each successful chunk pushes MIN earlier,
so the next chunk advances naturally. No state tracking, no per-org
heuristics. The cadc_thousand_oaks branch is left untouched (separate PR).
@dmarulli dmarulli merged commit e91d898 into main May 8, 2026
2 checks passed
@christophertull
Copy link
Copy Markdown
Member

christophertull commented May 13, 2026

@dmarulli this should be fine if it fixes the immediate need. I want to say the more complex min() logic was added because we encountered a situation with Valley County were the source data legitimately had a full - 2ish day data gap with no data in the source data. So in that case, grabbing the min datetime would get stuck in a way analogous to what happened here - the same day with zero reads would get fetched again and again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants