chore(iorails): Increase work queue concurrency and depth by tgasser-nv · Pull Request #1674 · NVIDIA-NeMo/Guardrails

tgasser-nv · 2026-02-27T20:15:27Z

Description

This PR tunes the AsyncWorkQueue depth and max concurrency to give good out-of-the-box improvements on internal Guardrails latency. The results were generated using the tooling under benchmark and steps-to-reproduce and results are copied below. The UNSAFE_PROBABILITY of the content-safety Mock LLM was changed to 0% (i.e. 100% safe) so every request passes through two content safety checks (input and output) and a generation LLM latency.

The benchmarking isolates only the internal Guardrails latency by using Mock LLMs for both content-safety and application LLM. The content-safety has a fixed latency of 500ms, and application LLM is fixed at 4s. As every request is classified as safe, this means the lower-bound on end-to-end latency is 500ms (input content-safety rail) + 4000ms (app LLM end-to-end latency) + 500ms (output content-safety rail) = 5s.

Mock-LLM only benchmarking

The table below shows end-to-end latency percentiles in milliseconds for only the Application Mock LLM itself (which has a configured latency of 4000ms). The latency values already have 4000ms subtracted, so this is the incremental latency on top of the configured latency. At a concurrency of 32, p50 and p99 latency are 12.33 and 27.03 respectively.

Baseline LLMRails benchmarking

The table below has the end-to-end latency results relative to the lower-bound of 5000ms (2 content-safety rails at 500ms each, with application LLM of 4000ms). The incremental latency at concurrency 32 is 1063ms and 1,257ms at p50 and p99 respectively. This shows the LLMRails internally adds a second or more to half of the requests at concurrency 32.

IORails benchmarking

The table below shows end-to-end results for the new IORails engine introduced in v0.21. IORails doesn't support streaming in this release, so Time-To-First-Token and Inter-Token-Latency aren't measured. IORails is optimized purely for input and output rails, without support for dialog, retrieval, or execution rails. The benchmarking setup is identical to the LLMRails benchmark, except calls are routed to IORails using the NEMO_GUARDRAILS_IORAILS_ENGINE=1 environment variable.

This table shows at a concurrency of 32, IORails reduces internal Guardrails p50 latency from 1063ms to 38ms (a 1.025s or 28x reduction) and p99 from 1257ms to 78ms (a 1.179s or 16x reduction).

Steps to reproduce

Guardrails and Mock LLM terminal

To run with IORails, add NEMO_GUARDRAILS_IORAILS_ENGINE=1 to the Guardrails invocation in line 4. This changes the Procfile line from:

gr: MAIN_MODEL_ENGINE=nim MAIN_MODEL_BASE_URL=http://localhost:8000 poetry run nemoguardrails server --config ../examples/configs/content_safety_local --default-config-id content_safety_local --port 9000

to:

gr: NEMO_GUARDRAILS_IORAILS_ENGINE=1 MAIN_MODEL_ENGINE=nim MAIN_MODEL_BASE_URL=http://localhost:8000 poetry run nemoguardrails server --config ../examples/configs/content_safety_local --default-config-id content_safety_local --port 9000

Commands:

$ poetry install --with dev -E "server nvidia openai"
$ poetry run pip install honcho
$ cd benchmark
$ poetry run honcho start
...
14:44:21 app_llm.1 | INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
14:44:21 cs_llm.1  | INFO:     Uvicorn running on http://0.0.0.0:8001 (Press CTRL+C to quit)
14:44:31 gr.1      | INFO:     Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit)

AIPerf terminal

Commands:

# Create the virtual environment if it doesn't exist
$ python -m venv benchmark_env
$ source benchmark_env/bin/activate
(benchmark_env) $ pip install -r benchmark/requirements.txt
(benchmark_env) $ python -m benchmark.aiperf --config-file benchmark/aiperf/configs/sweep_concurrency_benchmark.yaml

Checklist

I've read the CONTRIBUTING guidelines.
I've updated the documentation if applicable.
I've added tests if applicable.
@mentions of the person or team responsible for reviewing proposed changes.

codecov · 2026-02-27T20:45:07Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

…ne and base URL to local address

greptile-apps · 2026-02-27T21:02:28Z

Greptile Summary

Increased AsyncWorkQueue concurrency from 10 to 256 and queue depth from 100 to 256 to improve IORails performance under high load, with supporting benchmark configuration files.

Core change: MAX_CONCURRENCY and MAX_QUEUE_SIZE constants in guardrails.py tuned based on extensive benchmarking showing 28x p50 latency reduction (1063ms → 38ms) at concurrency 32
Benchmark infrastructure: Added new AIPerf sweep configuration and updated Procfile/mock LLM configs to reproduce the benchmarking results
Safe increase: AsyncWorkQueue uses lightweight asyncio tasks suitable for I/O-bound LLM calls, with proper error handling and backoff mechanisms in place

Confidence Score: 5/5

Safe to merge - well-tested performance optimization with comprehensive benchmarking
All changes are focused on performance tuning backed by thorough benchmarking data. The 25.6x concurrency increase is justified by the significant latency improvements shown in the PR description. AsyncWorkQueue uses lightweight asyncio tasks appropriate for I/O-bound work, with proper error handling. Supporting benchmark files are well-structured and follow existing patterns.
No files require special attention

Important Files Changed

Filename	Overview
nemoguardrails/guardrails/guardrails.py	Increased `MAX_QUEUE_SIZE` from 100 to 256 and `MAX_CONCURRENCY` from 10 to 256 for improved throughput under high load
benchmark/aiperf/configs/sweep_concurrency_benchmark.yaml	New benchmark configuration file for sweeping concurrency values [1-256] to test Guardrails performance
benchmark/Procfile	Added `MAIN_MODEL_ENGINE` and `MAIN_MODEL_BASE_URL` environment variables to configure local mock LLM for benchmarking
benchmark/mock_llm_server/configs/nvidia-llama-3.1-nemoguard-8b-content-safety.env	Changed `UNSAFE_PROBABILITY` from 0.03 to 0.0 (100% safe) to ensure all requests go through full pipeline for benchmarking

_{Last reviewed commit: aa56e77}

greptile-apps

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

tgasser-nv · 2026-02-27T21:11:24Z

Tagging @Pouyanpi , @cparisien , @trebedea for review

Change safe percentage to 100% on mock content-safety model

d7aea86

tgasser-nv added 2 commits February 27, 2026 14:56

Include environment variables to configure Guardrails main model engi…

014c9d0

…ne and base URL to local address

Check in AIPerf config using locally deployed Guardrails

aa56e77

tgasser-nv self-assigned this Feb 27, 2026

tgasser-nv requested review from Pouyanpi, cparisien and trebedea February 27, 2026 20:59

tgasser-nv marked this pull request as ready for review February 27, 2026 20:59

greptile-apps bot reviewed Feb 27, 2026

View reviewed changes

cparisien approved these changes Mar 3, 2026

View reviewed changes

tgasser-nv merged commit 451ed0d into develop Mar 3, 2026
43 checks passed

tgasser-nv deleted the chore/async-work-queue-config branch March 3, 2026 22:15

Pouyanpi pushed a commit that referenced this pull request Mar 4, 2026

chore(iorails): Increase work queue concurrency and depth (#1674)

ba7b25f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(iorails): Increase work queue concurrency and depth#1674

chore(iorails): Increase work queue concurrency and depth#1674
tgasser-nv merged 3 commits intodevelopfrom
chore/async-work-queue-config

tgasser-nv commented Feb 27, 2026 •

edited

Loading

Uh oh!

codecov bot commented Feb 27, 2026

Uh oh!

greptile-apps bot commented Feb 27, 2026

Confidence Score: 5/5

Uh oh!

greptile-apps bot left a comment

Uh oh!

tgasser-nv commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tgasser-nv commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Mock-LLM only benchmarking

Baseline LLMRails benchmarking

IORails benchmarking

Steps to reproduce

Guardrails and Mock LLM terminal

AIPerf terminal

Checklist

Uh oh!

codecov bot commented Feb 27, 2026

Codecov Report

Uh oh!

greptile-apps bot commented Feb 27, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

tgasser-nv commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tgasser-nv commented Feb 27, 2026 •

edited

Loading