chore(iorails): Increase work queue concurrency and depth#1674
Merged
tgasser-nv merged 3 commits intodevelopfrom Mar 3, 2026
Merged
chore(iorails): Increase work queue concurrency and depth#1674tgasser-nv merged 3 commits intodevelopfrom
tgasser-nv merged 3 commits intodevelopfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
…ne and base URL to local address
Contributor
Greptile SummaryIncreased AsyncWorkQueue concurrency from 10 to 256 and queue depth from 100 to 256 to improve IORails performance under high load, with supporting benchmark configuration files.
|
| Filename | Overview |
|---|---|
| nemoguardrails/guardrails/guardrails.py | Increased MAX_QUEUE_SIZE from 100 to 256 and MAX_CONCURRENCY from 10 to 256 for improved throughput under high load |
| benchmark/aiperf/configs/sweep_concurrency_benchmark.yaml | New benchmark configuration file for sweeping concurrency values [1-256] to test Guardrails performance |
| benchmark/Procfile | Added MAIN_MODEL_ENGINE and MAIN_MODEL_BASE_URL environment variables to configure local mock LLM for benchmarking |
| benchmark/mock_llm_server/configs/nvidia-llama-3.1-nemoguard-8b-content-safety.env | Changed UNSAFE_PROBABILITY from 0.03 to 0.0 (100% safe) to ensure all requests go through full pipeline for benchmarking |
Last reviewed commit: aa56e77
Collaborator
Author
|
Tagging @Pouyanpi , @cparisien , @trebedea for review |
cparisien
approved these changes
Mar 3, 2026
Pouyanpi
pushed a commit
that referenced
this pull request
Mar 4, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR tunes the AsyncWorkQueue depth and max concurrency to give good out-of-the-box improvements on internal Guardrails latency. The results were generated using the tooling under benchmark and steps-to-reproduce and results are copied below. The
UNSAFE_PROBABILITYof the content-safety Mock LLM was changed to 0% (i.e. 100% safe) so every request passes through two content safety checks (input and output) and a generation LLM latency.The benchmarking isolates only the internal Guardrails latency by using Mock LLMs for both content-safety and application LLM. The content-safety has a fixed latency of 500ms, and application LLM is fixed at 4s. As every request is classified as safe, this means the lower-bound on end-to-end latency is 500ms (input content-safety rail) + 4000ms (app LLM end-to-end latency) + 500ms (output content-safety rail) = 5s.
Mock-LLM only benchmarking
The table below shows end-to-end latency percentiles in milliseconds for only the Application Mock LLM itself (which has a configured latency of 4000ms). The latency values already have 4000ms subtracted, so this is the incremental latency on top of the configured latency. At a concurrency of 32, p50 and p99 latency are 12.33 and 27.03 respectively.
Baseline LLMRails benchmarking
The table below has the end-to-end latency results relative to the lower-bound of 5000ms (2 content-safety rails at 500ms each, with application LLM of 4000ms). The incremental latency at concurrency 32 is 1063ms and 1,257ms at p50 and p99 respectively. This shows the LLMRails internally adds a second or more to half of the requests at concurrency 32.
IORails benchmarking
The table below shows end-to-end results for the new IORails engine introduced in v0.21. IORails doesn't support streaming in this release, so Time-To-First-Token and Inter-Token-Latency aren't measured. IORails is optimized purely for input and output rails, without support for dialog, retrieval, or execution rails. The benchmarking setup is identical to the LLMRails benchmark, except calls are routed to IORails using the
NEMO_GUARDRAILS_IORAILS_ENGINE=1environment variable.This table shows at a concurrency of 32, IORails reduces internal Guardrails p50 latency from 1063ms to 38ms (a 1.025s or 28x reduction) and p99 from 1257ms to 78ms (a 1.179s or 16x reduction).
Steps to reproduce
Guardrails and Mock LLM terminal
To run with IORails, add
NEMO_GUARDRAILS_IORAILS_ENGINE=1to the Guardrails invocation in line 4. This changes the Procfile line from:gr: MAIN_MODEL_ENGINE=nim MAIN_MODEL_BASE_URL=http://localhost:8000 poetry run nemoguardrails server --config ../examples/configs/content_safety_local --default-config-id content_safety_local --port 9000to:
gr: NEMO_GUARDRAILS_IORAILS_ENGINE=1 MAIN_MODEL_ENGINE=nim MAIN_MODEL_BASE_URL=http://localhost:8000 poetry run nemoguardrails server --config ../examples/configs/content_safety_local --default-config-id content_safety_local --port 9000Commands:
AIPerf terminal
Commands:
Checklist