Skip to content

Commit e095f1e

Browse files
authored
feat(library): add PolicyAI Integration for Content Moderation (#1576)
1 parent 6bd0393 commit e095f1e

File tree

8 files changed

+1019
-2
lines changed

8 files changed

+1019
-2
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -295,9 +295,9 @@ Evaluating the safety of a LLM-based conversational application is a complex tas
295295

296296
## How is this different?
297297

298-
There are many ways guardrails can be added to an LLM-based conversational application. For example: explicit moderation endpoints (e.g., OpenAI, ActiveFence), critique chains (e.g. constitutional chain), parsing the output (e.g. guardrails.ai), individual guardrails (e.g., LLM-Guard), hallucination detection for RAG applications (e.g., Got It AI, Patronus Lynx).
298+
There are many ways guardrails can be added to an LLM-based conversational application. For example: explicit moderation endpoints (e.g., OpenAI, ActiveFence, PolicyAI), critique chains (e.g. constitutional chain), parsing the output (e.g. guardrails.ai), individual guardrails (e.g., LLM-Guard), hallucination detection for RAG applications (e.g., Got It AI, Patronus Lynx).
299299

300-
NeMo Guardrails aims to provide a flexible toolkit that can integrate all these complementary approaches into a cohesive LLM guardrails layer. For example, the toolkit provides out-of-the-box integration with ActiveFence, AlignScore and LangChain chains.
300+
NeMo Guardrails aims to provide a flexible toolkit that can integrate all these complementary approaches into a cohesive LLM guardrails layer. For example, the toolkit provides out-of-the-box integration with ActiveFence, PolicyAI, AlignScore and LangChain chains.
301301

302302
To the best of our knowledge, NeMo Guardrails is the only guardrails toolkit that also offers a solution for modeling the dialog between the user and the LLM. This enables on one hand the ability to guide the dialog in a precise way. On the other hand it enables fine-grained control for when certain guardrails should be used, e.g., use fact-checking only for certain types of questions.
303303

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
# PolicyAI Integration
2+
3+
NeMo Guardrails supports using the [PolicyAI](https://musubilabs.ai) content moderation API as an input and output rail out-of-the-box (you need to have the `POLICYAI_API_KEY` environment variable set).
4+
5+
PolicyAI provides flexible policy-based content moderation, allowing you to define custom policies for your specific use cases and manage them through tags.
6+
7+
## Setup
8+
9+
1. Sign up for a PolicyAI account at [musubilabs.ai](https://musubilabs.ai)
10+
2. Create your policies and organize them with tags
11+
3. Set the required environment variables:
12+
13+
```bash
14+
export POLICYAI_API_KEY="your-api-key"
15+
export POLICYAI_BASE_URL="https://api.musubilabs.ai" # Optional, this is the default
16+
export POLICYAI_TAG_NAME="prod" # Optional, defaults to "prod"
17+
```
18+
19+
## Usage
20+
21+
### Basic Input Moderation
22+
23+
```yaml
24+
rails:
25+
input:
26+
flows:
27+
- policyai moderation on input
28+
```
29+
30+
### Basic Output Moderation
31+
32+
```yaml
33+
rails:
34+
output:
35+
flows:
36+
- policyai moderation on output
37+
```
38+
39+
### Using Different Tags
40+
41+
To use different policy tags for different environments, set the `POLICYAI_TAG_NAME` environment variable:
42+
43+
```bash
44+
# For staging environment
45+
export POLICYAI_TAG_NAME="staging"
46+
47+
# For production environment
48+
export POLICYAI_TAG_NAME="prod"
49+
```
50+
51+
## Complete Example
52+
53+
```yaml
54+
models:
55+
- type: main
56+
engine: openai
57+
model: gpt-4
58+
59+
rails:
60+
input:
61+
flows:
62+
- policyai moderation on input
63+
64+
output:
65+
flows:
66+
- policyai moderation on output
67+
```
68+
69+
## How It Works
70+
71+
1. **Input Rails**: When a user sends a message, PolicyAI evaluates it against all policies attached to the configured tag. If any policy returns `UNSAFE`, the message is blocked.
72+
73+
2. **Output Rails**: Before the bot's response is sent to the user, PolicyAI evaluates it. If the content violates any policy, the response is replaced with a refusal message.
74+
75+
## Response Format
76+
77+
PolicyAI returns the following information for each evaluation:
78+
79+
- `assessment`: `"SAFE"` or `"UNSAFE"`
80+
- `category`: The category of violation (if UNSAFE)
81+
- `severity`: Severity level from 0 (safe) to 3 (high severity)
82+
- `reason`: Human-readable explanation
83+
84+
## Customizing Behavior
85+
86+
To customize the behavior when content is flagged, you can override the default flows in your config:
87+
88+
```text
89+
define subflow policyai moderation on input
90+
"""Custom PolicyAI input moderation."""
91+
$result = execute call_policyai_api(text=$user_message)
92+
93+
if $result.assessment == "UNSAFE"
94+
bot inform content policy violation
95+
stop
96+
97+
define bot inform content policy violation
98+
"I'm sorry, but I cannot process that request. Please rephrase your message."
99+
```
100+
101+
## Environment Variables
102+
103+
| Variable | Required | Default | Description |
104+
|----------|----------|---------|-------------|
105+
| `POLICYAI_API_KEY` | Yes | - | Your PolicyAI API key |
106+
| `POLICYAI_BASE_URL` | No | `https://api.musubilabs.ai` | PolicyAI API base URL |
107+
| `POLICYAI_TAG_NAME` | No | `prod` | Default policy tag to use |
108+
109+
## Error Handling
110+
111+
If the PolicyAI API is unavailable or returns an error, the action will raise an exception. To implement fail-open or fail-closed behavior, you can wrap the action in a try-catch block in your custom flows.
112+
113+
## Learn More
114+
115+
- [PolicyAI Documentation](https://docs.musubilabs.ai)
116+
- [Musubi Labs](https://musubilabs.ai)

docs/configure-rails/guardrail-catalog/third-party.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,26 @@ rails:
3434
3535
For more details, check out the [ActiveFence Integration](community/active-fence.md) page.
3636
37+
## PolicyAI
38+
39+
The NeMo Guardrails library supports using [PolicyAI](https://musubilabs.ai) by Musubi Labs as an input and output rail out-of-the-box (you need to have the `POLICYAI_API_KEY` environment variable set).
40+
41+
PolicyAI provides policy-based content moderation, allowing you to define custom policies and organize them with tags for environment-based management.
42+
43+
### Example usage
44+
45+
```yaml
46+
rails:
47+
input:
48+
flows:
49+
- policyai moderation on input
50+
output:
51+
flows:
52+
- policyai moderation on output
53+
```
54+
55+
For more details, check out the [PolicyAI Integration](community/policyai.md) page.
56+
3757
## AutoAlign
3858

3959
The NeMo Guardrails library supports using the AutoAlign's guardrails API (you need to have the `AUTOALIGN_API_KEY` environment variable set).
@@ -283,6 +303,7 @@ Llama Guard <community/llama-guard>
283303
Pangea AI Guard <community/pangea>
284304
Patronus Evaluate API <community/patronus-evaluate-api>
285305
Patronus Lynx <community/patronus-lynx>
306+
PolicyAI <community/policyai>
286307
Presidio <community/presidio>
287308
Private AI <community/privateai>
288309
Prompt Security <community/prompt-security>
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2023-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2023-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
"""
16+
PolicyAI Integration for NeMo Guardrails.
17+
18+
PolicyAI provides content moderation and policy enforcement capabilities
19+
for LLM applications. This integration allows using PolicyAI as an input
20+
and output rail for content moderation.
21+
22+
For more information, see: https://musubilabs.ai
23+
"""
24+
25+
import json
26+
import logging
27+
import os
28+
from typing import Optional
29+
30+
import aiohttp
31+
32+
from nemoguardrails.actions import action
33+
34+
log = logging.getLogger(__name__)
35+
36+
37+
def call_policyai_api_mapping(result: dict) -> bool:
38+
"""
39+
Mapping for call_policyai_api.
40+
41+
Expects result to be a dict with:
42+
- "assessment": "SAFE" or "UNSAFE"
43+
- "category": the violation category (if UNSAFE)
44+
- "severity": severity level 0-3
45+
46+
Block (return True) if:
47+
1. Assessment is "UNSAFE"
48+
"""
49+
assessment = result.get("assessment", "SAFE")
50+
return assessment == "UNSAFE"
51+
52+
53+
@action(is_system_action=True, output_mapping=call_policyai_api_mapping)
54+
async def call_policyai_api(
55+
text: Optional[str] = None,
56+
tag_name: Optional[str] = None,
57+
**kwargs,
58+
):
59+
"""
60+
Call the PolicyAI API to evaluate content.
61+
62+
Args:
63+
text: The text content to evaluate.
64+
tag_name: Optional tag name for the PolicyAI evaluation.
65+
If not provided, uses POLICYAI_TAG_NAME env var or "prod".
66+
67+
Returns:
68+
dict with:
69+
- assessment: "SAFE" or "UNSAFE"
70+
- category: the violation category (if UNSAFE)
71+
- severity: severity level 0-3
72+
- reason: explanation for the decision
73+
"""
74+
api_key = os.environ.get("POLICYAI_API_KEY")
75+
76+
if api_key is None:
77+
raise ValueError("POLICYAI_API_KEY environment variable not set.")
78+
79+
base_url = os.environ.get("POLICYAI_BASE_URL", "https://api.musubilabs.ai")
80+
base_url = base_url.rstrip("/")
81+
82+
# Get tag name from parameter, env var, or default
83+
if tag_name is None:
84+
tag_name = os.environ.get("POLICYAI_TAG_NAME", "prod")
85+
86+
url = f"{base_url}/policyai/v1/decisions/evaluate/{tag_name}"
87+
88+
headers = {
89+
"Musubi-Api-Key": api_key,
90+
"Content-Type": "application/json",
91+
}
92+
93+
data = {
94+
"content": [
95+
{
96+
"type": "TEXT",
97+
"content": text,
98+
}
99+
],
100+
}
101+
102+
timeout = aiohttp.ClientTimeout(total=30)
103+
async with aiohttp.ClientSession(timeout=timeout) as session:
104+
async with session.post(
105+
url=url,
106+
headers=headers,
107+
json=data,
108+
) as response:
109+
if response.status != 200:
110+
raise ValueError(
111+
f"PolicyAI call failed with status code {response.status}.\nDetails: {await response.text()}"
112+
)
113+
response_json = await response.json()
114+
log.info(json.dumps(response_json, indent=2))
115+
116+
# PolicyAI returns results in "data" array for tag-based evaluation
117+
results = response_json.get("data", [])
118+
119+
# Fail-closed: If no policies are attached to the tag, raise an error
120+
# rather than silently allowing content through
121+
if not results:
122+
raise ValueError(
123+
f"PolicyAI returned no policy results for tag '{tag_name}'. "
124+
"Ensure policies are attached to this tag."
125+
)
126+
127+
# Check if all policies failed evaluation
128+
successful_results = [r for r in results if r.get("status") != "failed"]
129+
if not successful_results:
130+
raise ValueError(
131+
f"All PolicyAI policy evaluations failed for tag '{tag_name}'. Check policy configurations."
132+
)
133+
134+
# Aggregate results - if ANY policy returns UNSAFE, overall is UNSAFE
135+
overall_assessment = "SAFE"
136+
triggered_category = "Safe"
137+
max_severity = 0
138+
reason = "Content passed all policy checks"
139+
140+
for result in successful_results:
141+
assessment = result.get("assessment", "SAFE")
142+
if assessment == "UNSAFE":
143+
overall_assessment = "UNSAFE"
144+
triggered_category = result.get("category", "Unknown")
145+
max_severity = max(max_severity, result.get("severity", 0))
146+
reason = result.get("reason", "Policy violation detected")
147+
break # Stop at first UNSAFE result
148+
149+
# Pre-format exception message for Colang 1.x compatibility
150+
# (Colang 1.x doesn't support string concatenation in create event)
151+
exception_message = f"PolicyAI moderation triggered. Content violated policy: {triggered_category}"
152+
153+
return {
154+
"assessment": overall_assessment,
155+
"category": triggered_category,
156+
"severity": max_severity,
157+
"reason": reason,
158+
"exception_message": exception_message,
159+
}
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
"""
2+
PolicyAI Integration Flows (Colang 2.x)
3+
4+
PolicyAI provides content moderation and policy enforcement capabilities.
5+
For more information, see: https://musubilabs.ai
6+
7+
Supported features:
8+
- Input moderation: Check user messages against configured policies
9+
- Output moderation: Check bot responses against configured policies
10+
- Tag-based evaluation: Use POLICYAI_TAG_NAME env var to specify policy tag
11+
"""
12+
13+
flow policyai moderation on input
14+
"""Guardrail based on PolicyAI assessment."""
15+
$result = await CallPolicyaiApiAction(text=$user_message)
16+
17+
if $result.assessment == "UNSAFE"
18+
if $system.config.enable_rails_exceptions
19+
send PolicyAIModerationRailException(message="PolicyAI moderation triggered. Content violated policy: " + $result.category)
20+
else
21+
bot refuse to respond
22+
abort
23+
24+
flow policyai moderation on output
25+
"""Guardrail based on PolicyAI assessment."""
26+
$result = await CallPolicyaiApiAction(text=$bot_message)
27+
28+
if $result.assessment == "UNSAFE"
29+
if $system.config.enable_rails_exceptions
30+
send PolicyAIModerationRailException(message="PolicyAI moderation triggered. Content violated policy: " + $result.category)
31+
else
32+
bot refuse to respond
33+
abort

0 commit comments

Comments
 (0)