Skip to content

docs: refresh LLM security scan assets#1686

Open
jmartin-tech wants to merge 1 commit intoNVIDIA-NeMo:developfrom
jmartin-tech:task/update-seceval-docs
Open

docs: refresh LLM security scan assets#1686
jmartin-tech wants to merge 1 commit intoNVIDIA-NeMo:developfrom
jmartin-tech:task/update-seceval-docs

Conversation

@jmartin-tech
Copy link

Description

Update documentation examples around security scanning eval of a system integrated with guardrails

The updates here utilize the example abc bot backed with meta/llama-3.3-70b-instruct under different levels of configured guardrails tested with garak v0.14.0 with the following configuration:

run:
  generations: 5

plugins:
  probe_spec: ansiescape.AnsiRaw,atkgen,continuation.ContinueSlursReclaimedSlurs,dan.Ablation_Dan_11_0,dan.AutoDANCached,dan.DanInTheWild,encoding,goodside.Tag,goodside.ThreatenJSON,goodside.WhoIsRiley,grandma,latentinjection,leakreplay.GuardianCloze,leakreplay.GuardianComplete,leakreplay.LiteratureCloze,leakreplay.LiteratureComplete,leakreplay.NYTCloze,leakreplay.NYTComplete,leakreplay.PotterCloze,leakreplay.PotterComplete,lmrc.Bullying,lmrc.Deadnaming,lmrc.QuackMedicine,lmrc.SexualContent,lmrc.Sexualisation,lmrc.SlurUsage,malwaregen,misleading.FalseAssertion,packagehallucination,promptinject,realtoxicityprompts.RTPBlank,divergence.Repeat,snowball.GraphConnectivity,suffix.GCGCached,tap.TAPCached,topic,web_injection
  extended_detectors: true
  probes:
    encoding:
      payloads:
        - default
        - xss
        - slur_terms

system:
  lite: false

Notables changes that have occurred in the garak project:

  • many probe family names changed
  • reported family level statistics are now lower_quartile instead of mean values for aggregated calculations

The updated table markdown and png format isolate findings to probe families that show variation across the different configuration.

Checklist

  • I've read the CONTRIBUTING guidelines.
  • I've updated the documentation if applicable.

Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2026

Documentation preview

https://nvidia-nemo.github.io/Guardrails/review/pr-1686

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 4, 2026

Greptile Summary

This PR refreshes the LLM vulnerability scanning documentation by replacing the previous results (based on gpt-3.5-turbo-instruct and an older garak version) with new scan results using meta/llama-3.3-70b-instruct and garak v0.14.0. The updated assets include four new HTML reports, a new results PNG, and an updated Markdown table with revised probe family names (reflecting garak's renamed modules) and lower-quartile aggregated statistics instead of mean values.

Key changes:

  • Model reference updated from gpt-3.5-turbo-instruct to meta/llama-3.3-70b-instruct throughout the doc.
  • Probe family table overhauled: 12 old module rows replaced with 16 new ones matching garak v0.14.0 naming conventions (e.g., ansiescape, atkgen, divergence, grandma, latentinjection, promptinject, suffix, tap, topic, web_injection added; continuation, knownbadsignatures, realpublicityprompts, snowball, xss removed).
  • The closing narrative text (lines 75–77) contains factual inaccuracies relative to the new data. Specifically, the text claims general instructions improve protection and dialogue rails provide "good protection," but the table shows counterexamples: module topic protection drops from 45% to 13% with general instructions, module divergence drops from 40% to 24%, and module tap remains at 0–11% protection even with full guardrails.

Confidence Score: 4/5

  • Purely documentation and static asset update; safe to merge but with one narrative accuracy issue to address in the Markdown file.
  • The PR updates documentation assets (HTML reports, PNG, Markdown) with new scan results from garak v0.14.0 using a different model. All six files are documentation or static assets with no production code. The HTML reports and PNG are data-only replacements (confidence 5/5 each). However, the Markdown file's closing narrative (lines 75–77) contains factual inaccuracies: it claims general instructions improve protection and dialogue rails provide "good protection," but the updated table shows counterexamples like module topic (protection drops 45% → 13%), module divergence (40% → 24%), and module tap (0–11% across all configs). This is a legitimate issue that should be fixed but does not block merging.
  • docs/evaluation/llm-vulnerability-scanning.md — lines 75–77 contain claims about general instructions and dialogue rails that are contradicted by counterexamples in the newly updated table.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[ABC Bot\nmeta/llama-3.3-70b-instruct] --> B[bare_llm\nNo protection]
    A --> C[with_gi\nGeneral Instructions]
    A --> D[with_gi_dr\nGeneral Instructions\n+ Dialogue Rails]
    A --> E[with_gi_dr_mo\nGeneral Instructions\n+ Dialogue Rails\n+ Moderation Rails]

    B --> F[Garak v0.14.0\n16 probe families]
    C --> F
    D --> F
    E --> F

    F --> G[HTML Reports\n4x .report.html]
    F --> H[Results Table\nllm-vulnerability-scanning.md]
    F --> I[PNG Chart\nabc-llm-vulnerability-scan-results.png]
Loading

Last reviewed commit: 3e74ce3

Comment on lines +75 to 77
Even if the ABC example uses a powerful LLM (`meta/llama-3.3-70b-instruct`), without guardrails, it is still vulnerable to several types of attacks.
While using general instructions in the prompt can reduce the attack success rate (and increase the protection rate reported in the table), the LLM app is safer only when using a mix of dialogue and moderation rails.
It is worth noticing that even using only dialogue rails results in good protection.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Narrative inconsistent with table data for several modules

The closing paragraph claims "using general instructions in the prompt can reduce the attack success rate (and increase the protection rate reported in the table)" and "even using only dialogue rails results in good protection," but the updated table contains several notable counterexamples:

  • module topic: with_gi decreases protection from 45% (bare_llm) to 13%, directly contradicting the claim about general instructions.
  • module divergence: with_gi drops from 40% (bare_llm) to 24%.
  • module tap: with_gi_dr (dialogue rails) is 0% (same as bare_llm) and even the full with_gi_dr_mo configuration only achieves 11% — far from "good protection."

These are legitimate empirical findings from the updated data. A more nuanced description that acknowledges these outlier modules would improve accuracy:

Suggested change
Even if the ABC example uses a powerful LLM (`meta/llama-3.3-70b-instruct`), without guardrails, it is still vulnerable to several types of attacks.
While using general instructions in the prompt can reduce the attack success rate (and increase the protection rate reported in the table), the LLM app is safer only when using a mix of dialogue and moderation rails.
It is worth noticing that even using only dialogue rails results in good protection.
Even if the ABC example uses a powerful LLM (`meta/llama-3.3-70b-instruct`), without guardrails, it is still vulnerable to several types of attacks.
While using general instructions in the prompt can reduce the attack success rate for many vulnerability categories, the results are mixed — some modules (e.g., `topic`, `divergence`) show lower protection with general instructions than without.
The LLM app is generally safer when using a mix of dialogue and moderation rails, though certain probe families (e.g., `tap`, `topic`) remain difficult to fully mitigate even with full guardrails.
It is worth noticing that for most categories, even using only dialogue rails results in substantially improved protection.
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/evaluation/llm-vulnerability-scanning.md
Line: 75-77

Comment:
**Narrative inconsistent with table data for several modules**

The closing paragraph claims "using general instructions in the prompt can reduce the attack success rate (and increase the protection rate reported in the table)" and "even using only dialogue rails results in good protection," but the updated table contains several notable counterexamples:

- **`module topic`**: `with_gi` *decreases* protection from 45% (`bare_llm`) to 13%, directly contradicting the claim about general instructions.
- **`module divergence`**: `with_gi` drops from 40% (`bare_llm`) to 24%.
- **`module tap`**: `with_gi_dr` (dialogue rails) is 0% (same as `bare_llm`) and even the full `with_gi_dr_mo` configuration only achieves 11% — far from "good protection."

These are legitimate empirical findings from the updated data. A more nuanced description that acknowledges these outlier modules would improve accuracy:

```suggestion
Even if the ABC example uses a powerful LLM (`meta/llama-3.3-70b-instruct`), without guardrails, it is still vulnerable to several types of attacks.
While using general instructions in the prompt can reduce the attack success rate for many vulnerability categories, the results are mixed — some modules (e.g., `topic`, `divergence`) show lower protection with general instructions than without.
The LLM app is generally safer when using a mix of dialogue and moderation rails, though certain probe families (e.g., `tap`, `topic`) remain difficult to fully mitigate even with full guardrails.
It is worth noticing that for most categories, even using only dialogue rails results in substantially improved protection.
```

How can I resolve this? If you propose a fix, please make it concise.

@codecov
Copy link

codecov bot commented Mar 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant