Skip to content

Add recent_days parameter to crawler API for backfills#75

Merged
prasjaiswal merged 1 commit intomainfrom
feat/crawler-recent-days-param
Feb 18, 2026
Merged

Add recent_days parameter to crawler API for backfills#75
prasjaiswal merged 1 commit intomainfrom
feat/crawler-recent-days-param

Conversation

@prasjaiswal
Copy link
Collaborator

Summary

  • Adds optional recent_days parameter to the POST /api/sites/{site}/schema-files endpoint
  • Currently only affects aajtak.in, which filters to recent files to avoid processing 1,500+ historical files on every refresh
  • Pass "recent_days": 1501 for a one-time full historical backfill; omit it for normal scheduled crawling (default 2 days)

Usage

# One-time full backfill
curl -X POST ".../api/sites/aajtak.in/schema-files" \
  -d '{"schema_map_url": "...", "recent_days": 1501}'

# Normal scheduled crawl (uses default 2 days)
curl -X POST ".../api/sites/aajtak.in/schema-files" \
  -d '{"schema_map_url": "..."}'

Test plan

  • make check passes (ruff, pyright 0 errors, 23 tests pass)
  • Tested locally: recent_days=1501 queued 1,501 jobs; default queued 3

🤖 Generated with Claude Code

Allows overriding the aajtak.in recent-days filter via the schema-files
API endpoint. Use a large value (e.g. 1501) for one-time full historical
backfills; scheduled jobs keep using the default 2 days.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@prasjaiswal prasjaiswal merged commit f73af29 into main Feb 18, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant