Skip to content

chore(iorails): Increase work queue concurrency and depth#1674

Merged
tgasser-nv merged 3 commits intodevelopfrom
chore/async-work-queue-config
Mar 3, 2026
Merged

chore(iorails): Increase work queue concurrency and depth#1674
tgasser-nv merged 3 commits intodevelopfrom
chore/async-work-queue-config

Conversation

@tgasser-nv
Copy link
Collaborator

@tgasser-nv tgasser-nv commented Feb 27, 2026

Description

This PR tunes the AsyncWorkQueue depth and max concurrency to give good out-of-the-box improvements on internal Guardrails latency. The results were generated using the tooling under benchmark and steps-to-reproduce and results are copied below. The UNSAFE_PROBABILITY of the content-safety Mock LLM was changed to 0% (i.e. 100% safe) so every request passes through two content safety checks (input and output) and a generation LLM latency.

The benchmarking isolates only the internal Guardrails latency by using Mock LLMs for both content-safety and application LLM. The content-safety has a fixed latency of 500ms, and application LLM is fixed at 4s. As every request is classified as safe, this means the lower-bound on end-to-end latency is 500ms (input content-safety rail) + 4000ms (app LLM end-to-end latency) + 500ms (output content-safety rail) = 5s.

Mock-LLM only benchmarking

The table below shows end-to-end latency percentiles in milliseconds for only the Application Mock LLM itself (which has a configured latency of 4000ms). The latency values already have 4000ms subtracted, so this is the incremental latency on top of the configured latency. At a concurrency of 32, p50 and p99 latency are 12.33 and 27.03 respectively.

image

Baseline LLMRails benchmarking

The table below has the end-to-end latency results relative to the lower-bound of 5000ms (2 content-safety rails at 500ms each, with application LLM of 4000ms). The incremental latency at concurrency 32 is 1063ms and 1,257ms at p50 and p99 respectively. This shows the LLMRails internally adds a second or more to half of the requests at concurrency 32.

image

IORails benchmarking

The table below shows end-to-end results for the new IORails engine introduced in v0.21. IORails doesn't support streaming in this release, so Time-To-First-Token and Inter-Token-Latency aren't measured. IORails is optimized purely for input and output rails, without support for dialog, retrieval, or execution rails. The benchmarking setup is identical to the LLMRails benchmark, except calls are routed to IORails using the NEMO_GUARDRAILS_IORAILS_ENGINE=1 environment variable.

This table shows at a concurrency of 32, IORails reduces internal Guardrails p50 latency from 1063ms to 38ms (a 1.025s or 28x reduction) and p99 from 1257ms to 78ms (a 1.179s or 16x reduction).

image

Steps to reproduce

Guardrails and Mock LLM terminal

To run with IORails, add NEMO_GUARDRAILS_IORAILS_ENGINE=1 to the Guardrails invocation in line 4. This changes the Procfile line from:

gr: MAIN_MODEL_ENGINE=nim MAIN_MODEL_BASE_URL=http://localhost:8000 poetry run nemoguardrails server --config ../examples/configs/content_safety_local --default-config-id content_safety_local --port 9000

to:

gr: NEMO_GUARDRAILS_IORAILS_ENGINE=1 MAIN_MODEL_ENGINE=nim MAIN_MODEL_BASE_URL=http://localhost:8000 poetry run nemoguardrails server --config ../examples/configs/content_safety_local --default-config-id content_safety_local --port 9000

Commands:

$ poetry install --with dev -E "server nvidia openai"
$ poetry run pip install honcho
$ cd benchmark
$ poetry run honcho start
...
14:44:21 app_llm.1 | INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
14:44:21 cs_llm.1  | INFO:     Uvicorn running on http://0.0.0.0:8001 (Press CTRL+C to quit)
14:44:31 gr.1      | INFO:     Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit)

AIPerf terminal

Commands:

# Create the virtual environment if it doesn't exist
$ python -m venv benchmark_env
$ source benchmark_env/bin/activate
(benchmark_env) $ pip install -r benchmark/requirements.txt
(benchmark_env) $ python -m benchmark.aiperf --config-file benchmark/aiperf/configs/sweep_concurrency_benchmark.yaml

Checklist

  • I've read the CONTRIBUTING guidelines.
  • I've updated the documentation if applicable.
  • I've added tests if applicable.
  • @mentions of the person or team responsible for reviewing proposed changes.

@codecov
Copy link

codecov bot commented Feb 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@tgasser-nv tgasser-nv self-assigned this Feb 27, 2026
@tgasser-nv tgasser-nv marked this pull request as ready for review February 27, 2026 20:59
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 27, 2026

Greptile Summary

Increased AsyncWorkQueue concurrency from 10 to 256 and queue depth from 100 to 256 to improve IORails performance under high load, with supporting benchmark configuration files.

  • Core change: MAX_CONCURRENCY and MAX_QUEUE_SIZE constants in guardrails.py tuned based on extensive benchmarking showing 28x p50 latency reduction (1063ms → 38ms) at concurrency 32
  • Benchmark infrastructure: Added new AIPerf sweep configuration and updated Procfile/mock LLM configs to reproduce the benchmarking results
  • Safe increase: AsyncWorkQueue uses lightweight asyncio tasks suitable for I/O-bound LLM calls, with proper error handling and backoff mechanisms in place

Confidence Score: 5/5

  • Safe to merge - well-tested performance optimization with comprehensive benchmarking
  • All changes are focused on performance tuning backed by thorough benchmarking data. The 25.6x concurrency increase is justified by the significant latency improvements shown in the PR description. AsyncWorkQueue uses lightweight asyncio tasks appropriate for I/O-bound work, with proper error handling. Supporting benchmark files are well-structured and follow existing patterns.
  • No files require special attention

Important Files Changed

Filename Overview
nemoguardrails/guardrails/guardrails.py Increased MAX_QUEUE_SIZE from 100 to 256 and MAX_CONCURRENCY from 10 to 256 for improved throughput under high load
benchmark/aiperf/configs/sweep_concurrency_benchmark.yaml New benchmark configuration file for sweeping concurrency values [1-256] to test Guardrails performance
benchmark/Procfile Added MAIN_MODEL_ENGINE and MAIN_MODEL_BASE_URL environment variables to configure local mock LLM for benchmarking
benchmark/mock_llm_server/configs/nvidia-llama-3.1-nemoguard-8b-content-safety.env Changed UNSAFE_PROBABILITY from 0.03 to 0.0 (100% safe) to ensure all requests go through full pipeline for benchmarking

Last reviewed commit: aa56e77

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@tgasser-nv
Copy link
Collaborator Author

Tagging @Pouyanpi , @cparisien , @trebedea for review

@tgasser-nv tgasser-nv merged commit 451ed0d into develop Mar 3, 2026
43 checks passed
@tgasser-nv tgasser-nv deleted the chore/async-work-queue-config branch March 3, 2026 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants