Is your feature request related to a problem? Please describe.
We are running R2R with Hatchet. Under high load from a client, we see repeated "THE TIME TO START THE STEP RUN IS TOO LONG, THE MAIN THREAD MAY BE BLOCKED" messages.
- The Hatchet scheduler is getting swamped by too many CPU-bound parse steps at once;
- When the system tips over, R2R cancels/retries en masse (“on_failure” churn), which makes the queue grow faster and stalls everything (including embeddings/KG).
Describe the solution you'd like
It would be useful having an api call to determine the health of the queue. This could be queried from a client before uploading further documents
Describe alternatives you've considered
- Circuit breaker inside R2R: a 60-second poller that checks “Waiting Steps” from Hatchet; if >N (e.g., 80), flip a process flag that makes ingest endpoints return 429 with Retry-After. This is a 30-line FastAPI middleware.
- Observability: Prometheus counters on step queue length and durations; alert when waiting > 60 for 2 minutes.
Is your feature request related to a problem? Please describe.
We are running R2R with Hatchet. Under high load from a client, we see repeated "THE TIME TO START THE STEP RUN IS TOO LONG, THE MAIN THREAD MAY BE BLOCKED" messages.
Describe the solution you'd like
It would be useful having an api call to determine the health of the queue. This could be queried from a client before uploading further documents
Describe alternatives you've considered