docs: refresh LLM security scan assets by jmartin-tech · Pull Request #1686 · NVIDIA-NeMo/Guardrails

jmartin-tech · 2026-03-04T17:15:19Z

Description

Update documentation examples around security scanning eval of a system integrated with guardrails

The updates here utilize the example abc bot backed with meta/llama-3.3-70b-instruct under different levels of configured guardrails tested with garak v0.14.0 with the following configuration:

run:
  generations: 5

plugins:
  probe_spec: ansiescape.AnsiRaw,atkgen,continuation.ContinueSlursReclaimedSlurs,dan.Ablation_Dan_11_0,dan.AutoDANCached,dan.DanInTheWild,encoding,goodside.Tag,goodside.ThreatenJSON,goodside.WhoIsRiley,grandma,latentinjection,leakreplay.GuardianCloze,leakreplay.GuardianComplete,leakreplay.LiteratureCloze,leakreplay.LiteratureComplete,leakreplay.NYTCloze,leakreplay.NYTComplete,leakreplay.PotterCloze,leakreplay.PotterComplete,lmrc.Bullying,lmrc.Deadnaming,lmrc.QuackMedicine,lmrc.SexualContent,lmrc.Sexualisation,lmrc.SlurUsage,malwaregen,misleading.FalseAssertion,packagehallucination,promptinject,realtoxicityprompts.RTPBlank,divergence.Repeat,snowball.GraphConnectivity,suffix.GCGCached,tap.TAPCached,topic,web_injection
  extended_detectors: true
  probes:
    encoding:
      payloads:
        - default
        - xss
        - slur_terms

system:
  lite: false

Notables changes that have occurred in the garak project:

many probe family names changed
reported family level statistics are now lower_quartile instead of mean values for aggregated calculations

The updated table markdown and png format isolate findings to probe families that show variation across the different configuration.

Checklist

I've read the CONTRIBUTING guidelines.
I've updated the documentation if applicable.

Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>

github-actions · 2026-03-04T17:17:36Z

Documentation preview

https://nvidia-nemo.github.io/Guardrails/review/pr-1686

greptile-apps · 2026-03-04T17:18:41Z

Greptile Summary

This PR refreshes the LLM vulnerability scanning documentation by replacing the previous results (based on gpt-3.5-turbo-instruct and an older garak version) with new scan results using meta/llama-3.3-70b-instruct and garak v0.14.0. The updated assets include four new HTML reports, a new results PNG, and an updated Markdown table with revised probe family names (reflecting garak's renamed modules) and lower-quartile aggregated statistics instead of mean values.

Key changes:

Model reference updated from gpt-3.5-turbo-instruct to meta/llama-3.3-70b-instruct throughout the doc.
Probe family table overhauled: 12 old module rows replaced with 16 new ones matching garak v0.14.0 naming conventions (e.g., ansiescape, atkgen, divergence, grandma, latentinjection, promptinject, suffix, tap, topic, web_injection added; continuation, knownbadsignatures, realpublicityprompts, snowball, xss removed).
The closing narrative text (lines 75–77) contains factual inaccuracies relative to the new data. Specifically, the text claims general instructions improve protection and dialogue rails provide "good protection," but the table shows counterexamples: module topic protection drops from 45% to 13% with general instructions, module divergence drops from 40% to 24%, and module tap remains at 0–11% protection even with full guardrails.

Confidence Score: 4/5

Purely documentation and static asset update; safe to merge but with one narrative accuracy issue to address in the Markdown file.
The PR updates documentation assets (HTML reports, PNG, Markdown) with new scan results from garak v0.14.0 using a different model. All six files are documentation or static assets with no production code. The HTML reports and PNG are data-only replacements (confidence 5/5 each). However, the Markdown file's closing narrative (lines 75–77) contains factual inaccuracies: it claims general instructions improve protection and dialogue rails provide "good protection," but the updated table shows counterexamples like module topic (protection drops 45% → 13%), module divergence (40% → 24%), and module tap (0–11% across all configs). This is a legitimate issue that should be fixed but does not block merging.
docs/evaluation/llm-vulnerability-scanning.md — lines 75–77 contain claims about general instructions and dialogue rails that are contradicted by counterexamples in the newly updated table.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[ABC Bot\nmeta/llama-3.3-70b-instruct] --> B[bare_llm\nNo protection]
    A --> C[with_gi\nGeneral Instructions]
    A --> D[with_gi_dr\nGeneral Instructions\n+ Dialogue Rails]
    A --> E[with_gi_dr_mo\nGeneral Instructions\n+ Dialogue Rails\n+ Moderation Rails]

    B --> F[Garak v0.14.0\n16 probe families]
    C --> F
    D --> F
    E --> F

    F --> G[HTML Reports\n4x .report.html]
    F --> H[Results Table\nllm-vulnerability-scanning.md]
    F --> I[PNG Chart\nabc-llm-vulnerability-scan-results.png]

_{Last reviewed commit: 3e74ce3}

greptile-apps · 2026-03-04T17:18:45Z

docs/evaluation/llm-vulnerability-scanning.md

+Even if the ABC example uses a powerful LLM (`meta/llama-3.3-70b-instruct`), without guardrails, it is still vulnerable to several types of attacks.
 While using general instructions in the prompt can reduce the attack success rate (and increase the protection rate reported in the table), the LLM app is safer only when using a mix of dialogue and moderation rails.
 It is worth noticing that even using only dialogue rails results in good protection.


Narrative inconsistent with table data for several modules

The closing paragraph claims "using general instructions in the prompt can reduce the attack success rate (and increase the protection rate reported in the table)" and "even using only dialogue rails results in good protection," but the updated table contains several notable counterexamples:

module topic: with_gi decreases protection from 45% (bare_llm) to 13%, directly contradicting the claim about general instructions.

module divergence: with_gi drops from 40% (bare_llm) to 24%.

module tap: with_gi_dr (dialogue rails) is 0% (same as bare_llm) and even the full with_gi_dr_mo configuration only achieves 11% — far from "good protection."

These are legitimate empirical findings from the updated data. A more nuanced description that acknowledges these outlier modules would improve accuracy:

Suggested change

Even if the ABC example uses a powerful LLM (`meta/llama-3.3-70b-instruct`), without guardrails, it is still vulnerable to several types of attacks.

While using general instructions in the prompt can reduce the attack success rate (and increase the protection rate reported in the table), the LLM app is safer only when using a mix of dialogue and moderation rails.

It is worth noticing that even using only dialogue rails results in good protection.

Even if the ABC example uses a powerful LLM (`meta/llama-3.3-70b-instruct`), without guardrails, it is still vulnerable to several types of attacks.

While using general instructions in the prompt can reduce the attack success rate for many vulnerability categories, the results are mixed — some modules (e.g., `topic`, `divergence`) show lower protection with general instructions than without.

The LLM app is generally safer when using a mix of dialogue and moderation rails, though certain probe families (e.g., `tap`, `topic`) remain difficult to fully mitigate even with full guardrails.

It is worth noticing that for most categories, even using only dialogue rails results in substantially improved protection.

Prompt To Fix With AI

This is a comment left during a code review. Path: docs/evaluation/llm-vulnerability-scanning.md Line: 75-77 Comment: **Narrative inconsistent with table data for several modules** The closing paragraph claims "using general instructions in the prompt can reduce the attack success rate (and increase the protection rate reported in the table)" and "even using only dialogue rails results in good protection," but the updated table contains several notable counterexamples: - **`module topic`**: `with_gi` *decreases* protection from 45% (`bare_llm`) to 13%, directly contradicting the claim about general instructions. - **`module divergence`**: `with_gi` drops from 40% (`bare_llm`) to 24%. - **`module tap`**: `with_gi_dr` (dialogue rails) is 0% (same as `bare_llm`) and even the full `with_gi_dr_mo` configuration only achieves 11% — far from "good protection." These are legitimate empirical findings from the updated data. A more nuanced description that acknowledges these outlier modules would improve accuracy: ```suggestion Even if the ABC example uses a powerful LLM (`meta/llama-3.3-70b-instruct`), without guardrails, it is still vulnerable to several types of attacks. While using general instructions in the prompt can reduce the attack success rate for many vulnerability categories, the results are mixed — some modules (e.g., `topic`, `divergence`) show lower protection with general instructions than without. The LLM app is generally safer when using a mix of dialogue and moderation rails, though certain probe families (e.g., `tap`, `topic`) remain difficult to fully mitigate even with full guardrails. It is worth noticing that for most categories, even using only dialogue rails results in substantially improved protection. ``` How can I resolve this? If you propose a fix, please make it concise.

codecov · 2026-03-04T17:32:29Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

refresh LLM security scan assets

3e74ce3

Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>

greptile-apps bot reviewed Mar 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: refresh LLM security scan assets#1686

docs: refresh LLM security scan assets#1686
jmartin-tech wants to merge 1 commit intoNVIDIA-NeMo:developfrom
jmartin-tech:task/update-seceval-docs

jmartin-tech commented Mar 4, 2026

Uh oh!

github-actions bot commented Mar 4, 2026

Uh oh!

greptile-apps bot commented Mar 4, 2026

Confidence Score: 4/5

Flowchart

Uh oh!

greptile-apps bot Mar 4, 2026

Uh oh!

codecov bot commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jmartin-tech commented Mar 4, 2026

Description

Checklist

Uh oh!

github-actions bot commented Mar 4, 2026

Documentation preview

Uh oh!

greptile-apps bot commented Mar 4, 2026

Greptile Summary

Confidence Score: 4/5

Flowchart

Uh oh!

greptile-apps bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 4, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant