docs: refresh LLM security scan assets#1686
docs: refresh LLM security scan assets#1686jmartin-tech wants to merge 1 commit intoNVIDIA-NeMo:developfrom
Conversation
Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
Documentation preview |
Greptile SummaryThis PR refreshes the LLM vulnerability scanning documentation by replacing the previous results (based on Key changes:
|
| Even if the ABC example uses a powerful LLM (`meta/llama-3.3-70b-instruct`), without guardrails, it is still vulnerable to several types of attacks. | ||
| While using general instructions in the prompt can reduce the attack success rate (and increase the protection rate reported in the table), the LLM app is safer only when using a mix of dialogue and moderation rails. | ||
| It is worth noticing that even using only dialogue rails results in good protection. |
There was a problem hiding this comment.
Narrative inconsistent with table data for several modules
The closing paragraph claims "using general instructions in the prompt can reduce the attack success rate (and increase the protection rate reported in the table)" and "even using only dialogue rails results in good protection," but the updated table contains several notable counterexamples:
module topic:with_gidecreases protection from 45% (bare_llm) to 13%, directly contradicting the claim about general instructions.module divergence:with_gidrops from 40% (bare_llm) to 24%.module tap:with_gi_dr(dialogue rails) is 0% (same asbare_llm) and even the fullwith_gi_dr_moconfiguration only achieves 11% — far from "good protection."
These are legitimate empirical findings from the updated data. A more nuanced description that acknowledges these outlier modules would improve accuracy:
| Even if the ABC example uses a powerful LLM (`meta/llama-3.3-70b-instruct`), without guardrails, it is still vulnerable to several types of attacks. | |
| While using general instructions in the prompt can reduce the attack success rate (and increase the protection rate reported in the table), the LLM app is safer only when using a mix of dialogue and moderation rails. | |
| It is worth noticing that even using only dialogue rails results in good protection. | |
| Even if the ABC example uses a powerful LLM (`meta/llama-3.3-70b-instruct`), without guardrails, it is still vulnerable to several types of attacks. | |
| While using general instructions in the prompt can reduce the attack success rate for many vulnerability categories, the results are mixed — some modules (e.g., `topic`, `divergence`) show lower protection with general instructions than without. | |
| The LLM app is generally safer when using a mix of dialogue and moderation rails, though certain probe families (e.g., `tap`, `topic`) remain difficult to fully mitigate even with full guardrails. | |
| It is worth noticing that for most categories, even using only dialogue rails results in substantially improved protection. |
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/evaluation/llm-vulnerability-scanning.md
Line: 75-77
Comment:
**Narrative inconsistent with table data for several modules**
The closing paragraph claims "using general instructions in the prompt can reduce the attack success rate (and increase the protection rate reported in the table)" and "even using only dialogue rails results in good protection," but the updated table contains several notable counterexamples:
- **`module topic`**: `with_gi` *decreases* protection from 45% (`bare_llm`) to 13%, directly contradicting the claim about general instructions.
- **`module divergence`**: `with_gi` drops from 40% (`bare_llm`) to 24%.
- **`module tap`**: `with_gi_dr` (dialogue rails) is 0% (same as `bare_llm`) and even the full `with_gi_dr_mo` configuration only achieves 11% — far from "good protection."
These are legitimate empirical findings from the updated data. A more nuanced description that acknowledges these outlier modules would improve accuracy:
```suggestion
Even if the ABC example uses a powerful LLM (`meta/llama-3.3-70b-instruct`), without guardrails, it is still vulnerable to several types of attacks.
While using general instructions in the prompt can reduce the attack success rate for many vulnerability categories, the results are mixed — some modules (e.g., `topic`, `divergence`) show lower protection with general instructions than without.
The LLM app is generally safer when using a mix of dialogue and moderation rails, though certain probe families (e.g., `tap`, `topic`) remain difficult to fully mitigate even with full guardrails.
It is worth noticing that for most categories, even using only dialogue rails results in substantially improved protection.
```
How can I resolve this? If you propose a fix, please make it concise.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Description
Update documentation examples around security scanning eval of a system integrated with guardrails
The updates here utilize the example abc bot backed with
meta/llama-3.3-70b-instructunder different levels of configured guardrails tested with garak v0.14.0 with the following configuration:Notables changes that have occurred in the
garakproject:lower_quartileinstead ofmeanvalues for aggregated calculationsThe updated table markdown and png format isolate findings to probe families that show variation across the different configuration.
Checklist