API call for queue health

**Is your feature request related to a problem? Please describe.**
We are running R2R with Hatchet. Under high load from a client, we see repeated "THE TIME TO START THE STEP RUN IS TOO LONG, THE MAIN THREAD MAY BE BLOCKED" messages.

- The Hatchet scheduler is getting swamped by too many CPU-bound parse steps at once;
- When the system tips over, R2R cancels/retries en masse (“on_failure” churn), which makes the queue grow faster and stalls everything (including embeddings/KG).

**Describe the solution you'd like**
It would be useful having an api call to determine the health of the queue. This could be queried from a client before uploading further documents

**Describe alternatives you've considered**

1. Circuit breaker inside R2R: a 60-second poller that checks “Waiting Steps” from Hatchet; if >N (e.g., 80), flip a process flag that makes ingest endpoints return 429 with Retry-After. This is a 30-line FastAPI middleware.
2. Observability: Prometheus counters on step queue length and durations; alert when waiting > 60 for 2 minutes.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API call for queue health #2251

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

API call for queue health #2251

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions