feat(library): add PolicyAI Integration for Content Moderation (#1576)

vitchor · web-flow · commit e095f1eda573 · 2026-03-05T16:00:23.000+01:00
diff --git a/README.md b/README.md
@@ -295,9 +295,9 @@ Evaluating the safety of a LLM-based conversational application is a complex tas
 
 ## How is this different?
 
-There are many ways guardrails can be added to an LLM-based conversational application. For example: explicit moderation endpoints (e.g., OpenAI, ActiveFence), critique chains (e.g. constitutional chain), parsing the output (e.g. guardrails.ai), individual guardrails (e.g., LLM-Guard), hallucination detection for RAG applications (e.g., Got It AI, Patronus Lynx).
+There are many ways guardrails can be added to an LLM-based conversational application. For example: explicit moderation endpoints (e.g., OpenAI, ActiveFence, PolicyAI), critique chains (e.g. constitutional chain), parsing the output (e.g. guardrails.ai), individual guardrails (e.g., LLM-Guard), hallucination detection for RAG applications (e.g., Got It AI, Patronus Lynx).
 
-NeMo Guardrails aims to provide a flexible toolkit that can integrate all these complementary approaches into a cohesive LLM guardrails layer. For example, the toolkit provides out-of-the-box integration with ActiveFence, AlignScore and LangChain chains.
+NeMo Guardrails aims to provide a flexible toolkit that can integrate all these complementary approaches into a cohesive LLM guardrails layer. For example, the toolkit provides out-of-the-box integration with ActiveFence, PolicyAI, AlignScore and LangChain chains.
 
 To the best of our knowledge, NeMo Guardrails is the only guardrails toolkit that also offers a solution for modeling the dialog between the user and the LLM. This enables on one hand the ability to guide the dialog in a precise way. On the other hand it enables fine-grained control for when certain guardrails should be used, e.g., use fact-checking only for certain types of questions.
 
diff --git a/docs/configure-rails/guardrail-catalog/community/policyai.md b/docs/configure-rails/guardrail-catalog/community/policyai.md
@@ -0,0 +1,116 @@
+# PolicyAI Integration
+
+NeMo Guardrails supports using the [PolicyAI](https://musubilabs.ai) content moderation API as an input and output rail out-of-the-box (you need to have the `POLICYAI_API_KEY` environment variable set).
+
+PolicyAI provides flexible policy-based content moderation, allowing you to define custom policies for your specific use cases and manage them through tags.
+
+## Setup
+
+1. Sign up for a PolicyAI account at [musubilabs.ai](https://musubilabs.ai)
+2. Create your policies and organize them with tags
+3. Set the required environment variables:
+
+```bash
+export POLICYAI_API_KEY="your-api-key"
+export POLICYAI_BASE_URL="https://api.musubilabs.ai"  # Optional, this is the default
+export POLICYAI_TAG_NAME="prod"  # Optional, defaults to "prod"
+```
+
+## Usage
+
+### Basic Input Moderation
+
+```yaml
+rails:
+  input:
+    flows:
+      - policyai moderation on input
+```
+
+### Basic Output Moderation
+
+```yaml
+rails:
+  output:
+    flows:
+      - policyai moderation on output
+```
+
+### Using Different Tags
+
+To use different policy tags for different environments, set the `POLICYAI_TAG_NAME` environment variable:
+
+```bash
+# For staging environment
+export POLICYAI_TAG_NAME="staging"
+
+# For production environment
+export POLICYAI_TAG_NAME="prod"
+```
+
+## Complete Example
+
+```yaml
+models:
+  - type: main
+    engine: openai
+    model: gpt-4
+
+rails:
+  input:
+    flows:
+      - policyai moderation on input
+
+  output:
+    flows:
+      - policyai moderation on output
+```
+
+## How It Works
+
+1. **Input Rails**: When a user sends a message, PolicyAI evaluates it against all policies attached to the configured tag. If any policy returns `UNSAFE`, the message is blocked.
+
+2. **Output Rails**: Before the bot's response is sent to the user, PolicyAI evaluates it. If the content violates any policy, the response is replaced with a refusal message.
+
+## Response Format
+
+PolicyAI returns the following information for each evaluation:
+
+- `assessment`: `"SAFE"` or `"UNSAFE"`
+- `category`: The category of violation (if UNSAFE)
+- `severity`: Severity level from 0 (safe) to 3 (high severity)
+- `reason`: Human-readable explanation
+
+## Customizing Behavior
+
+To customize the behavior when content is flagged, you can override the default flows in your config:
+
+```text
+define subflow policyai moderation on input
+  """Custom PolicyAI input moderation."""
+  $result = execute call_policyai_api(text=$user_message)
+
+  if $result.assessment == "UNSAFE"
+    bot inform content policy violation
+    stop
+
+define bot inform content policy violation
+  "I'm sorry, but I cannot process that request. Please rephrase your message."
+```
+
+## Environment Variables
+
+| Variable | Required | Default | Description |
+|----------|----------|---------|-------------|
+| `POLICYAI_API_KEY` | Yes | - | Your PolicyAI API key |
+| `POLICYAI_BASE_URL` | No | `https://api.musubilabs.ai` | PolicyAI API base URL |
+| `POLICYAI_TAG_NAME` | No | `prod` | Default policy tag to use |
+
+## Error Handling
+
+If the PolicyAI API is unavailable or returns an error, the action will raise an exception. To implement fail-open or fail-closed behavior, you can wrap the action in a try-catch block in your custom flows.
+
+## Learn More
+
+- [PolicyAI Documentation](https://docs.musubilabs.ai)
+- [Musubi Labs](https://musubilabs.ai)
diff --git a/docs/configure-rails/guardrail-catalog/third-party.md b/docs/configure-rails/guardrail-catalog/third-party.md
@@ -34,6 +34,26 @@ rails:
 
 For more details, check out the [ActiveFence Integration](community/active-fence.md) page.
 
+## PolicyAI
+
+The NeMo Guardrails library supports using [PolicyAI](https://musubilabs.ai) by Musubi Labs as an input and output rail out-of-the-box (you need to have the `POLICYAI_API_KEY` environment variable set).
+
+PolicyAI provides policy-based content moderation, allowing you to define custom policies and organize them with tags for environment-based management.
+
+### Example usage
+
+```yaml
+rails:
+  input:
+    flows:
+      - policyai moderation on input
+  output:
+    flows:
+      - policyai moderation on output
+```
+
+For more details, check out the [PolicyAI Integration](community/policyai.md) page.
+
 ## AutoAlign
 
 The NeMo Guardrails library supports using the AutoAlign's guardrails API (you need to have the `AUTOALIGN_API_KEY` environment variable set).
@@ -283,6 +303,7 @@ Llama Guard <community/llama-guard>
 Pangea AI Guard <community/pangea>
 Patronus Evaluate API <community/patronus-evaluate-api>
 Patronus Lynx <community/patronus-lynx>
+PolicyAI <community/policyai>
 Presidio <community/presidio>
 Private AI <community/privateai>
 Prompt Security <community/prompt-security>
diff --git a/nemoguardrails/library/policyai/__init__.py b/nemoguardrails/library/policyai/__init__.py
@@ -0,0 +1,14 @@
+# SPDX-FileCopyrightText: Copyright (c) 2023-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/nemoguardrails/library/policyai/actions.py b/nemoguardrails/library/policyai/actions.py
@@ -0,0 +1,159 @@
+# SPDX-FileCopyrightText: Copyright (c) 2023-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+PolicyAI Integration for NeMo Guardrails.
+
+PolicyAI provides content moderation and policy enforcement capabilities
+for LLM applications. This integration allows using PolicyAI as an input
+and output rail for content moderation.
+
+For more information, see: https://musubilabs.ai
+"""
+
+import json
+import logging
+import os
+from typing import Optional
+
+import aiohttp
+
+from nemoguardrails.actions import action
+
+log = logging.getLogger(__name__)
+
+
+def call_policyai_api_mapping(result: dict) -> bool:
+    """
+    Mapping for call_policyai_api.
+
+    Expects result to be a dict with:
+      - "assessment": "SAFE" or "UNSAFE"
+      - "category": the violation category (if UNSAFE)
+      - "severity": severity level 0-3
+
+    Block (return True) if:
+      1. Assessment is "UNSAFE"
+    """
+    assessment = result.get("assessment", "SAFE")
+    return assessment == "UNSAFE"
+
+
+@action(is_system_action=True, output_mapping=call_policyai_api_mapping)
+async def call_policyai_api(
+    text: Optional[str] = None,
+    tag_name: Optional[str] = None,
+    **kwargs,
+):
+    """
+    Call the PolicyAI API to evaluate content.
+
+    Args:
+        text: The text content to evaluate.
+        tag_name: Optional tag name for the PolicyAI evaluation.
+                  If not provided, uses POLICYAI_TAG_NAME env var or "prod".
+
+    Returns:
+        dict with:
+          - assessment: "SAFE" or "UNSAFE"
+          - category: the violation category (if UNSAFE)
+          - severity: severity level 0-3
+          - reason: explanation for the decision
+    """
+    api_key = os.environ.get("POLICYAI_API_KEY")
+
+    if api_key is None:
+        raise ValueError("POLICYAI_API_KEY environment variable not set.")
+
+    base_url = os.environ.get("POLICYAI_BASE_URL", "https://api.musubilabs.ai")
+    base_url = base_url.rstrip("/")
+
+    # Get tag name from parameter, env var, or default
+    if tag_name is None:
+        tag_name = os.environ.get("POLICYAI_TAG_NAME", "prod")
+
+    url = f"{base_url}/policyai/v1/decisions/evaluate/{tag_name}"
+
+    headers = {
+        "Musubi-Api-Key": api_key,
+        "Content-Type": "application/json",
+    }
+
+    data = {
+        "content": [
+            {
+                "type": "TEXT",
+                "content": text,
+            }
+        ],
+    }
+
+    timeout = aiohttp.ClientTimeout(total=30)
+    async with aiohttp.ClientSession(timeout=timeout) as session:
+        async with session.post(
+            url=url,
+            headers=headers,
+            json=data,
+        ) as response:
+            if response.status != 200:
+                raise ValueError(
+                    f"PolicyAI call failed with status code {response.status}.\nDetails: {await response.text()}"
+                )
+            response_json = await response.json()
+            log.info(json.dumps(response_json, indent=2))
+
+            # PolicyAI returns results in "data" array for tag-based evaluation
+            results = response_json.get("data", [])
+
+            # Fail-closed: If no policies are attached to the tag, raise an error
+            # rather than silently allowing content through
+            if not results:
+                raise ValueError(
+                    f"PolicyAI returned no policy results for tag '{tag_name}'. "
+                    "Ensure policies are attached to this tag."
+                )
+
+            # Check if all policies failed evaluation
+            successful_results = [r for r in results if r.get("status") != "failed"]
+            if not successful_results:
+                raise ValueError(
+                    f"All PolicyAI policy evaluations failed for tag '{tag_name}'. Check policy configurations."
+                )
+
+            # Aggregate results - if ANY policy returns UNSAFE, overall is UNSAFE
+            overall_assessment = "SAFE"
+            triggered_category = "Safe"
+            max_severity = 0
+            reason = "Content passed all policy checks"
+
+            for result in successful_results:
+                assessment = result.get("assessment", "SAFE")
+                if assessment == "UNSAFE":
+                    overall_assessment = "UNSAFE"
+                    triggered_category = result.get("category", "Unknown")
+                    max_severity = max(max_severity, result.get("severity", 0))
+                    reason = result.get("reason", "Policy violation detected")
+                    break  # Stop at first UNSAFE result
+
+            # Pre-format exception message for Colang 1.x compatibility
+            # (Colang 1.x doesn't support string concatenation in create event)
+            exception_message = f"PolicyAI moderation triggered. Content violated policy: {triggered_category}"
+
+            return {
+                "assessment": overall_assessment,
+                "category": triggered_category,
+                "severity": max_severity,
+                "reason": reason,
+                "exception_message": exception_message,
+            }
diff --git a/nemoguardrails/library/policyai/flows.co b/nemoguardrails/library/policyai/flows.co
@@ -0,0 +1,33 @@
+"""
+PolicyAI Integration Flows (Colang 2.x)
+
+PolicyAI provides content moderation and policy enforcement capabilities.
+For more information, see: https://musubilabs.ai
+
+Supported features:
+- Input moderation: Check user messages against configured policies
+- Output moderation: Check bot responses against configured policies
+- Tag-based evaluation: Use POLICYAI_TAG_NAME env var to specify policy tag
+"""
+
+flow policyai moderation on input
+  """Guardrail based on PolicyAI assessment."""
+  $result = await CallPolicyaiApiAction(text=$user_message)
+
+  if $result.assessment == "UNSAFE"
+    if $system.config.enable_rails_exceptions
+      send PolicyAIModerationRailException(message="PolicyAI moderation triggered. Content violated policy: " + $result.category)
+    else
+      bot refuse to respond
+    abort
+
+flow policyai moderation on output
+  """Guardrail based on PolicyAI assessment."""
+  $result = await CallPolicyaiApiAction(text=$bot_message)
+
+  if $result.assessment == "UNSAFE"
+    if $system.config.enable_rails_exceptions
+      send PolicyAIModerationRailException(message="PolicyAI moderation triggered. Content violated policy: " + $result.category)
+    else
+      bot refuse to respond
+    abort
diff --git a/nemoguardrails/library/policyai/flows.v1.co b/nemoguardrails/library/policyai/flows.v1.co
diff --git a/tests/test_policyai_rail.py b/tests/test_policyai_rail.py