Prompt Injection Workbench

A research workbench for developing and testing attacks against large language models, with a focus on prompt injection vulnerabilities and defenses.

Key Features

State Machine Design: Fine-grained control over agent execution for advanced attack scenarios
SWE-bench Support: Benchmark agents on real-world code editing tasks from SWE-bench
Hydra Configuration: Powerful experiment orchestration with parameter sweeps
Extensible Architecture: Plugin system for custom agents, attacks, and environments
Usage Limits: Built-in cost and resource controls
Experiment Tracking: Automatic caching and result organization

Quick Start

Installation

Install the core package with desired optional features:

# Full installation (all features)
uv sync --all-extras

# Or install only what you need:
uv sync --extra agentdojo      # AgentDojo benchmark support
uv sync --extra swebench       # SWE-bench support
uv sync --extra docker         # Docker sandbox manager
uv sync --extra playwright     # Web automation environment

# Combine multiple extras
uv sync --extra agentdojo --extra docker

Available optional dependencies:

Extra	Description
`agentdojo`	AgentDojo dataset, environment, and attacks
`swebench`	SWE-bench dataset for code editing benchmarks
`docker`	Docker sandbox manager
`playwright`	Web automation environment

Set up environment variables:

cp .env.example .env  # Fill in API keys

Export default configuration:

# Export to ./config (default)
uv run prompt-siren config export

# Export to custom directory
uv run prompt-siren config export ./my_config

Run experiments:

# Run benign-only evaluation
uv run prompt-siren run benign +dataset=agentdojo-workspace

# Run with attack
uv run prompt-siren run attack +dataset=agentdojo-workspace +attack=template_string

# Run SWE-bench evaluation (requires Docker)
uv run prompt-siren run benign +dataset=swebench

# Run SWE-bench with specific instances
uv run prompt-siren run benign +dataset=swebench dataset.config.instance_ids='["django__django-11179"]'

# Run SWE-bench Lite (smaller benchmark)
uv run prompt-siren run benign +dataset=swebench dataset.config.dataset_name="SWE-bench/SWE-bench_Lite"

# Override parameters
uv run prompt-siren run benign +dataset=agentdojo-workspace agent.config.model=azure:gpt-5

# Parameter sweep (multirun)
uv run prompt-siren run benign --multirun +dataset=agentdojo-workspace agent.config.model=azure:gpt-5,azure:gpt-5-nano

# Validate configuration without running
uv run prompt-siren config validate +dataset=agentdojo-workspace

# Use config file with environment/attack included (no overrides needed)
uv run prompt-siren run attack --config-dir=./my_config

Tip: Environment and attack can be specified via CLI overrides or included directly in config files. See the Configuration Guide for details.

Analyzing Results

After running experiments, use the results command to aggregate and analyze results:

# View results with default settings (pass@1, grouped by all configs)
uv run prompt-siren results

# Specify custom jobs directory
uv run prompt-siren results --jobs-dir=./jobs

# Group results by different dimensions
uv run prompt-siren results --group-by=model
uv run prompt-siren results --group-by=env
uv run prompt-siren results --group-by=agent
uv run prompt-siren results --group-by=attack

# Compute pass@k metrics (k>1)
uv run prompt-siren results --k=5
uv run prompt-siren results --k=10

# Compute multiple pass@k metrics simultaneously
uv run prompt-siren results --k=1 --k=5 --k=10

# Different output formats
uv run prompt-siren results --format=json
uv run prompt-siren results --format=csv

Understanding pass@k Metrics

pass@1 (default): Averages scores across all runs for each task. A continuous metric showing average performance.
pass@k (k>1): Binary success metric. A task "passes" if at least one of k runs achieves a perfect score (1.0). Uses statistical estimation when more than k samples are available.

Results Columns

The results table includes:

Configuration columns: dataset, agent_type, agent_name, attack_type
Metric columns: benign_pass@k, attack_pass@k - The pass@k scores
Metadata columns:
- n_tasks - Total number of tasks aggregated
- avg_n_samples - Average number of runs per task
- k - The k value (when computing multiple pass@k metrics)

Platform Requirements

Python: 3.10+
Package Manager for development: uv (for dependency management)
Operating System: Linux or macOS (Windows not supported)
Docker: Required for SWE-bench integration and some environments
- Must be running and accessible
- Base images should have /bin/bash available (Alpine images need bash package)

Documentation

Configuration Guide - Hydra configuration and parameter overrides
Usage Limits - Resource limits and cost controls
Plugins - Adding custom agents, attacks, and environments

Development

Linting, Formatting and Typechecking

# Lint and format
uv run ruff check --fix
uv run ruff format
uv run ty check

# Test
uv run pytest -v

Pre-building Docker Images for SWE-bench

The main prompt-siren CLI only works with pre-built Docker images. To build all required images for SWE-bench evaluations, use the prompt-siren-build-images command:

# Build all images (benign, malicious service containers, and pairs)
uv run prompt-siren-build-images

# Build images for specific instances only
uv run prompt-siren-build-images --instance-ids django__django-11179 --instance-ids astropy__astropy-12907

# Limit the number of instances to build (useful for testing)
uv run prompt-siren-build-images --max-instances 5

# Use SWE-bench Lite dataset
uv run prompt-siren-build-images --dataset "SWE-bench/SWE-bench_Lite"

# Skip building certain image types
uv run prompt-siren-build-images --skip-benign      # Skip benign task images
uv run prompt-siren-build-images --skip-malicious   # Skip malicious task service images
uv run prompt-siren-build-images --skip-pairs       # Skip combined pair images

# Rebuild existing images instead of skipping them
uv run prompt-siren-build-images --rebuild-existing

# Tag and push images to a registry
uv run prompt-siren-build-images --registry my-registry.com/myrepo

# Enable verbose logging for debugging
uv run prompt-siren-build-images --verbose

# Specify a custom cache directory
uv run prompt-siren-build-images --cache-dir /path/to/cache

Common workflows:

# Quick test: Build images for a single instance
uv run prompt-siren-build-images --instance-ids django__django-11179

# Full build for production: Build all images and push to registry
uv run prompt-siren-build-images --registry my-registry.com/swebench

# Rebuild after updates: Force rebuild of existing images
uv run prompt-siren-build-images --rebuild-existing

# Build only benign images (no attack scenarios)
uv run prompt-siren-build-images --skip-malicious --skip-pairs

Note: Once images are built, they are cached locally and will be reused by the main prompt-siren run command. The main CLI does not build images on-the-fly—it expects pre-built images to be available.

License

Prompt Siren is licensed under an MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
.github		.github
config/attack		config/attack
docs		docs
resources/img		resources/img
src/prompt_siren		src/prompt_siren
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prompt Injection Workbench

Key Features

Quick Start

Installation

Analyzing Results

Understanding pass@k Metrics

Results Columns

Platform Requirements

Documentation

Development

Linting, Formatting and Typechecking

Pre-building Docker Images for SWE-bench

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 7

Uh oh!

Languages

License

facebookresearch/prompt-siren

Folders and files

Latest commit

History

Repository files navigation

Prompt Injection Workbench

Key Features

Quick Start

Installation

Analyzing Results

Understanding pass@k Metrics

Results Columns

Platform Requirements

Documentation

Development

Linting, Formatting and Typechecking

Pre-building Docker Images for SWE-bench

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 7

Uh oh!

Languages

Packages