Replace percentile heuristic with MIN(flowtime) for backfill chunk selection#222
Merged
Conversation
…lection The heuristic-based chunk selector got stuck on orgs whose leading edge of backfilled data had below-threshold per-day read volume (legitimate ramp-up at the start of vendor data). cadc_crescent's backfill stopped advancing at 2024-11-29 even though Xylem has data back to 2023-01-01. Replace with raw MIN(flowtime) -- each successful chunk pushes MIN earlier, so the next chunk advances naturally. No state tracking, no per-org heuristics. The cadc_thousand_oaks branch is left untouched (separate PR).
6 tasks
Member
|
@dmarulli this should be fine if it fixes the immediate need. I want to say the more complex |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
SnowflakeStorageSink.calculate_end_of_backfill_rangeused a percentile-threshold heuristic to identify the leading edge of already-backfilled data. The heuristic returnsMIN(date)filtered to days whose read count exceeds 50% of the 75th-percentile of daily counts. This gets stuck when an org's leading edge has below-threshold per-day volume — typically the legitimate ramp-up period at the start of vendor data.cadc_crescent's backfill DAG has been silently re-processing the same(2024-11-29, 2024-12-29)chunk every run for ~2 weeks. Xylem has data back to 2023-01-01 (the configured backfill min_date), but ourMIN(flowtime)hasn't moved because the chunker keeps targeting data we already have.Change
Replace the percentile heuristic with raw
MIN(flowtime)over the configured range. After each successful chunk,MIN(flowtime)shifts earlier, so the next run picks an older chunk automatically. No state tracking, no per-org branches.What was considered and rejected
cadc_thousand_oakspattern: works but adds another per-org hack on top of an existing one.backfill_progressstate table: more invasive, schema change, not needed for the immediate fix.Trade-off
The new approach gives up one narrow case the old heuristic handled: a transient ETL failure that loaded a partial day at the leading edge would self-heal on retry under the old logic. With raw MIN, a partial leading-edge day would persist; recovery requires a manual DAG run on that range. This case is narrow in practice (most partials are middle-of-range, where neither approach helps), and the data-gap quality check
notify_on_failure()isFalseanyway, so the old behavior was self-healing without operator visibility.Affected orgs
Only
cadc_crescenthas an active backfill DAG (verified viaconfiguration_backfills). Spot-checked READINGS shape per org — only stray-shape candidates werecadc_thousand_oaks(handled by its own branch, untouched here) andcadc_moulton_niguel(no backfill DAG). No regression risk for any other org.Follow-up
The
cadc_thousand_oaksbranch + orphanbackfillstable row will be cleaned up in a separate PR.Test plan