A comprehensive benchmarking system for Large Language Models (LLMs)
Installation β’ Quick Start β’ Features β’ Documentation β’ Contributing
Note: Karenina is still experimental and under active, fast-paced development. APIs and features may change without notice. A first stable release will be available soon β stay tuned!
- About Karenina
- Architecture
- Understanding the Problem
- Quick Start
- Command-Line Interface
- Why Templates
- Templates vs Rubrics
- Features
- Installation
- Documentation
- Contributing
Karenina is a framework designed to standardize domain expertise and concepts into runnable benchmarks. The core challenge Karenina addresses is: making the formulation of domain-specific benchmarks accessible to non-LLM-technical experts, allowing them to focus their time and expertise on knowledge rather than infrastructure.
Key Concepts:
- Benchmarks are expressed as parametrizable code templates, which can be evaluated with an LLM-as-a-judge model to evaluate performance
- Standardized schema (building on existing standards such as schema.org) enables rich, consistent, and extensible benchmark definitions
- Tools to generate benchmarks at scale while maintaining quality and consistency
- JSON-LD format enables seamless integration between Python library and GUI interface
- Utilities to run and manage benchmarks, although its primary focus remains on standardization and accessibility rather than execution infrastructure
At the heart of Karenina are two key concepts: templates and rubrics. Templates verify factual correctness through structured answer parsing, while rubrics assess qualitative traits, format compliance, and quantitative metrics.
Karenina is a standalone Python library that can be used independently for all benchmarking workflows through Python code.
Karenina uses a hexagonal architecture (Ports & Adapters) for LLM interactions. Three protocol interfaces define what the application needs:
- LLMPort β Basic LLM text generation
- AgentPort β Agentic LLM with tool use and MCP support
- ParserPort β Structured output parsing into Pydantic models
Each supported interface (langchain, claude_agent_sdk, claude_tool, openrouter, openai_endpoint, manual) provides adapter implementations for these ports. An adapter factory handles instantiation, and an AdapterInstructionRegistry manages interface-specific prompt transformations β keeping adapters as pure executors that receive pre-assembled prompts.
To guarantee additional accessibility to the framework, a web-based graphical interface is available for users who prefer not to work with code. This no-code interface covers most features provided by the backend, including:
- Visual question and metadata extraction from files (Excel, CSV, TSV)
- Template generation with interactive preview and editing
- No-code rubric curation (LLM-based, regex, and metric traits)
- Checkpointing and verification execution with real-time progress monitoring
- Results visualization and export management
The GUI makes the Karenina framework accessible to domain experts, curators, and non-technical users who want to create and run benchmarks without writing Python code.
Implementation: The graphical interface is built using two companion packages:
- karenina-server - Exposes the karenina backend as a FastAPI-based REST API
- karenina-gui - TypeScript/React web application providing the user interface
Note: Coordination and deployment instructions for the full web-based stack are still a work in progress and will be released soon.
Let us introduce how Karenina approaches the problem of LLM benchmarking by considering a simple example: we want to task an LLM with a simple multiple-choice question:
question = "Which protein regulates programmed cell death (apoptosis)?"
possible_answers = ["BCL2", "p53", "Insulin", "Hemoglobin"]When we query a standard LLM, it usually responds in free text (e.g., "BCL2 is the protein that regulates apoptosis by preventing cell death."). To evaluate such an answer programmatically we could use the following approaches:
We directly instruct the answering model to return a response in a machine-friendly format.
Example prompt:
You are answering a multiple-choice question.
Return only the letter of your choice.
Question: Which protein regulates programmed cell death (apoptosis)?
Options:
A) BCL2
B) p53
C) Insulin
D) Hemoglobin
Answer:
Model output: A
The main advantage of this approach is its simplicity and reliability: once the model respects the instruction, evaluation can be fully automated with minimal overhead. However, its weakness lies in the fragility of prompt adherence. Many general-purpose LLMs do not always comply with rigid output constraints, especially across diverse domains or when questions become complex. In practice, this means users must design very careful prompts and may still face occasional formatting failures. Moreover, every time we have a different answer/question format we may need to come up with different dedicated prompting and parsing strategies.
Instead of constraining the answering model, we can keep its output free-form and rely on a judge LLM to interpret it.
Example:
- Answering model output:
"BCL2 is an anti-apoptotic protein that prevents cell death." - Judge model prompt:
The following is a student's answer to a multiple-choice question. Question: Which protein regulates programmed cell death (apoptosis)? Options: BCL2, p53, Insulin, Hemoglobin. Student's answer: "BCL2 is an anti-apoptotic protein that prevents cell death." Which option does this correspond to? Provide a justification. - Judge model output:
"The student clearly selected BCL2, which is correct as it regulates apoptosis."
The advantage here is flexibility: the answering model is free to behave naturally, without tight formatting constraints, which is particularly useful in open-ended or exploratory settings. However, this shifts the ambiguity to the judge's response, which is also often free text. While the judge usually interprets correctly, the result again requires parsing, and subtle differences in wording may cause errors or inconsistencies. Thus, while this strategy increases robustness to different kinds of answers, it does so at the cost of reintroducing unstructured evaluation one step later.
To reduce ambiguity, Karenina adopts a third approach that combines the advantages of both approaches:
- The answering model remains unconstrained, generating natural free text
- The judge model is required to return results in a structured format (JSON), validated through a Pydantic class
This setup allows the judge to flexibly interpret free text while ensuring that its own output remains standardized and machine-readable.
1. Define a Pydantic template:
from karenina.schemas.entities import BaseAnswer
from pydantic import Field
class Answer(BaseAnswer):
answer: str = Field(description="The name of the protein mentioned in the response")
def model_post_init(self, __context):
self.correct = {"answer": "BCL2"}
def verify(self) -> bool:
return self.answer.strip().upper() == self.correct["answer"].strip().upper()Key aspects:
- The
answerattribute usesFielddescription to guide the judge - The
verifymethod implements custom validation logic
2. Answering model generates free text:
"BCL2 is the protein that regulates apoptosis by preventing cell death."
3. Judge model parses into structured format:
from langchain_core.output_parsers import PydanticOutputParser
parser = PydanticOutputParser(pydantic_object=Answer)
prompt = parser.get_format_instructions()
prompt += "\n LLM Answer: BCL2 is the protein that regulates apoptosis by preventing cell death."
judge_answer = llm.invoke(prompt)Judge output (structured JSON):
{"answer": "BCL2"}4. Verification step:
populated_answer = Answer(**judge_answer)
result = populated_answer.verify() # TrueGet started with Karenina in just a few minutes! This example demonstrates the core workflow: create a benchmark, add questions, generate templates, and run verification.
from karenina import Benchmark
# Create a new benchmark
benchmark = Benchmark.create(
name="Genomics Knowledge Benchmark",
description="Testing LLM knowledge of genomics and molecular biology",
version="1.0.0",
creator="Your Name"
)# Add questions with answers
questions = [
("How many chromosomes are in a human somatic cell?", "46"),
("What is the approved drug target of Venetoclax?", "BCL2"),
("How many protein subunits does hemoglobin A have?", "4")
]
question_ids = []
for q, a in questions:
qid = benchmark.add_question(
question=q,
raw_answer=a,
author={"name": "Bio Curator"}
)
question_ids.append(qid)Note: You can also extract questions from Excel, CSV, or TSV files. See Adding Questions for file extraction examples.
from karenina.schemas import ModelConfig
# Configure the LLM for template generation
model_config = ModelConfig(
id="gpt-4.1-mini",
model_provider="openai",
model_name="gpt-4.1-mini",
temperature=0.1,
interface="langchain"
)
# Generate templates for all questions
benchmark.generate_all_templates(model_config=model_config)Note: Templates can also be written manually for complex custom logic. See Templates Guide for details.
from karenina.schemas import LLMRubricTrait
# Add a global rubric trait to assess answer quality
benchmark.add_global_rubric_trait(
LLMRubricTrait(
name="Conciseness",
description="Rate how concise the answer is (1-5)",
kind="score"
)
)from karenina.schemas import VerificationConfig
# Configure verification
config = VerificationConfig(
answering_models=[model_config],
parsing_models=[model_config],
rubric_enabled=True
)
# Run verification
results = benchmark.run_verification(config)
# Analyze results
passed = sum(1 for r in results if r.template.verify_result)
print(f"Pass Rate: {(passed/len(results)*100):.1f}%")# Save benchmark checkpoint
benchmark.save("genomics_benchmark.jsonld")
# Export results to CSV
from pathlib import Path
benchmark.export_verification_results_to_file(
file_path=Path("results.csv"),
format="csv"
)Congratulations! You've created your first Karenina benchmark with automatic template generation and rubric-based evaluation.
Next steps: Explore the complete tutorial for:
- Question-specific rubrics (regex and metric-based)
- File extraction from Excel/CSV
- Multiple model comparison
- Few-shot prompting
- Result analysis and visualization
For users who prefer working from the terminal, Karenina provides a comprehensive CLI for running verifications without writing Python code. The CLI is ideal for automation, CI/CD pipelines, and quick testing.
# Run verification with a preset configuration
karenina verify checkpoint.jsonld --preset default.json --verbose
# Run with CLI arguments only (no preset required)
karenina verify checkpoint.jsonld \
--answering-model gpt-4.1-mini \
--parsing-model gpt-4.1-mini \
--output results.csv
# Override preset values with CLI flags
karenina verify checkpoint.jsonld \
--preset default.json \
--answering-model gpt-4o \
--questions 0-5
# Interactive configuration builder
karenina verify checkpoint.jsonld --interactive --mode basic# List available presets
karenina preset list
# Show preset configuration
karenina preset show gpt-oss
# Delete a preset
karenina preset delete old-config# Start the web server (serves GUI + API)
karenina serve --port 8080
# Initialize the webapp (first-time setup)
karenina init# Inspect progressive save state from a previous run
karenina verify-status results/- Flexible Configuration: Use presets, CLI arguments, or interactive mode
- Question Filtering: Select specific questions by index or ID (e.g.,
0-5,0,2,4) - Multiple Output Formats: Export results to JSON or CSV with comprehensive metadata
- Progress Monitoring: Real-time progress bars with pass/fail indicators
- Fail-Fast Validation: Validates inputs before running to avoid wasted API calls
- CI/CD Ready: Easy integration with GitHub Actions and other automation tools
The CLI supports flexible configuration with clear precedence:
CLI flags > Preset values > Environment variables > Defaults
This means you can use presets for base configuration and override specific values with CLI arguments as needed.
For complete CLI documentation, including all options, examples, and CI/CD integration guides, see CLI Verification.
Templates play a central role in Karenina by standardizing how answers are parsed, verified, and evaluated. Their use provides several key benefits:
Templates allow parsing to happen directly through the judge LLM. The free-text answer from the answering model is mapped into a structured format (e.g., a Pydantic class), ensuring that:
- Evaluation logic is bundled with the question-answer pair itself
- The same benchmark can seamlessly accommodate different answer formats without custom code
Since LLMs are proficient at code generation, they can often auto-generate Pydantic classes from raw question-answer pairs. This means that large portions of benchmark creation can be partially automated, reducing manual effort while improving consistency.
By embedding the evaluation schema in templates, the judge LLM's task is simplified. Instead of reasoning about both the content and the evaluation logic, the judge focuses only on interpreting the free-text answer and filling in the template.
Templates make it straightforward to extend benchmarks:
- New tasks can be added by defining new templates without re-engineering downstream code
- The same evaluation logic can be reused across multiple benchmarks with minimal adaptation
By encoding evaluation criteria into explicit, inspectable templates, benchmarks become more transparent. This allows developers to:
- Audit the evaluation rules directly
- Debug failures more easily by inspecting the structured outputs rather than opaque free text
While templates excel at verifying factual correctness, many evaluation scenarios require assessing qualitative traits, format compliance, or quantitative metrics. This is where rubrics complement templates.
| Aspect | Answer Templates | Rubrics |
|---|---|---|
| Purpose | Verify factual correctness | Assess qualitative traits, format, and metrics |
| Evaluation Method | Programmatic field comparison | Four approaches: β’ LLM judgment β’ Regex patterns β’ Custom Python functions β’ Term extraction + metrics |
| Best for | Precise, unambiguous answers | Subjective qualities, format validation, custom logic, quantitative analysis |
| Trait Types | Single verification method | Four types: β’ LLM-based (qualitative) β’ Regex-based (format) β’ Callable (custom Python) β’ Metric-based (term extraction) |
| Output | Pass/fail per field | β’ Boolean (binary traits) β’ Scores 1-5 (score traits) β’ Class index (literal traits) β’ Precision/Recall/F1 (metric traits) |
| Examples | "BCL2", "46 chromosomes" |
β’ "Is the answer concise?" (LLM) β’ Match email pattern (regex) β’ Extract diseases for F1 score (metric) |
| Scope | Per question | Global or per question |
Karenina supports four types of rubric traits, each suited for different evaluation needs:
1. LLM-Based Traits
AI-evaluated qualitative assessments where a judge LLM evaluates subjective qualities:
- Score-based (1-5): "Rate the scientific accuracy of the answer"
- Binary (pass/fail): "Does the answer mention safety concerns?"
- Literal (classification): "Classify the tone as: formal, casual, or technical"
2. Regex Pattern Traits
Deterministic validation using regular expressions for format compliance:
- "Answer must contain a DNA sequence (pattern:
[ATCG]+)" - "Response must include enzyme names (pattern:
\w+ase\b)"
3. Callable Traits
Custom Python functions for domain-specific evaluation logic:
- Word count validation: "Is the response between 50-500 words?"
- Custom scoring: "Count technical terms from a predefined list"
- Complex business rules that can't be expressed as regex
4. Metric-Based Traits
Quantitative evaluation using confusion matrix metrics:
- Define terms that SHOULD appear (True Positives)
- Define terms that SHOULD NOT appear (False Positives)
- System computes precision, recall, F1, and optionally specificity/accuracy
When to use what:
- Use templates when you need to verify specific factual content or structured data
- Use LLM-based rubrics for subjective quality assessment (clarity, conciseness, tone)
- Use regex rubrics for format compliance and deterministic keyword checks
- Use callable rubrics for custom logic that requires programmatic evaluation
- Use metric rubrics when evaluating classification accuracy by extracting and measuring term coverage
- Use both together for comprehensive evaluation covering correctness AND quality
Learn more about Templates β | Learn more about Rubrics β
Karenina provides comprehensive tools for every stage of the benchmarking workflow:
- Question Management: Extract questions from files (Excel, CSV, TSV) with rich metadata support
- Answer Templates: Pydantic-based templates for structured evaluation and programmatic verification
- Rubric Evaluation: Assess qualitative traits using four types:
- LLM-based traits (binary pass/fail or 1-5 scale)
- Regex-based traits (pattern matching for format validation)
- Callable traits (custom Python functions)
- Metric-based traits (precision, recall, F1, accuracy)
- Benchmark Verification: Run evaluations with six supported interfaces:
langchain(OpenAI, Google Gemini, Anthropic Claude via LangChain)claude_agent_sdk(Native Anthropic Agent SDK)claude_tool(Claude-specific tool use with native structured output)openrouter(OpenRouter platform)openai_endpoint(OpenAI-compatible endpoints for local models)manual(Manual trace replay for testing/debugging)
- 13-Stage Verification Pipeline: Modular, configurable pipeline from template validation through answer generation, parsing, verification, embedding checks, rubric evaluation, and deep-judgment β each stage can be enabled/disabled independently
- Ports & Adapters Architecture: Hexagonal design with protocol interfaces (LLMPort, AgentPort, ParserPort) decoupled from backend implementations, enabling easy addition of new LLM providers
- Sufficiency Check: Validate response quality before parsing (optional stage)
- Deep-Judgment Parsing: Extract verbatim excerpts, reasoning traces, and confidence scores with configurable modes (disabled, enable_all, per-trait custom)
- Abstention Detection: Identify when models refuse to answer questions
- Embedding Check: Semantic similarity fallback using SentenceTransformers to reduce false negatives
- Few-Shot Prompting: Configure examples globally or per question with flexible selection modes
- Task-Centric Evaluation (TaskEval): Attach verification criteria to existing agent traces for evaluation without re-running
- Multi-Model Comparison: Run evaluations across multiple answering models in a single batch
- Async Execution: Parallel processing with configurable worker pools for faster batch runs
- GEPA Integration: Prompt optimization framework with train/test splitting, feedback generation, and improvement tracking
- MCP Integration: Support for Model Context Protocol servers and tool use tracking
- Search-Enhanced Validation: Tavily search integration for hallucination detection and evidence cross-referencing
- Database Persistence: SQLite storage with versioning and 10+ analytical views
- Export & Reporting: CSV and JSON formats for analysis with selective column export
- Preset Management: Save and reuse verification configurations with full hierarchy support
- Progressive Save: Automatic checkpointing during long verification runs with resume capability
View complete feature catalog β
- Python 3.11 or higher
- Git
uv(Python's fast package manager - recommended)
If you don't have uv installed:
curl -LsSf https://astral.sh/uv/install.sh | shFor other installation methods, see uv's documentation.
Note: Karenina is not yet published to PyPI. Install from the GitHub repository:
# Clone the repository
git clone https://github.com/biocypher/karenina.git
cd karenina
# Install with uv (recommended)
uv pip install -e .
# Or use pip
pip install -e .The -e flag installs in editable mode, allowing you to pull updates with git pull without reinstalling.
Configure API keys for LLM providers:
| Provider | Variable | Models |
|---|---|---|
| OpenAI | OPENAI_API_KEY |
GPT-4, GPT-4 mini |
GOOGLE_API_KEY |
Gemini | |
| Anthropic | ANTHROPIC_API_KEY |
Claude |
| OpenRouter | OPENROUTER_API_KEY |
Unified access |
Recommended: Create a .env file in your project root
OPENAI_API_KEY="sk-..."
GOOGLE_API_KEY="AIza..."
ANTHROPIC_API_KEY="sk-ant-..."Then add .env to .gitignore to prevent committing secrets:
echo ".env" >> .gitignoreAlternative: Export to your shell
export OPENAI_API_KEY="sk-..."Note: API keys can also be passed programmatically via extra_kwargs in ModelConfig. See the Configuration Guide for all options including feature toggles, execution control, and database settings.
Test that Karenina is installed correctly:
from karenina import Benchmark
# Create a simple benchmark
benchmark = Benchmark.create(
name="test-benchmark",
description="Installation verification",
version="1.0.0"
)
print(f"β Karenina installed successfully!")
print(f"β Benchmark created: {benchmark.name}")For detailed setup instructions, troubleshooting, and development installation, see the Installation Guide.
Ready to explore more of Karenina's capabilities? Check out our comprehensive documentation:
You can view the full documentation with a live preview using MkDocs:
# From the karenina directory
uv run mkdocs serveThen open your browser to http://127.0.0.1:8000 to browse the documentation with full navigation and search.
- Documentation Index - Complete documentation overview with navigation
- Installation Guide - Detailed setup instructions and requirements
- Quick Start Tutorial - Step-by-step guide to your first benchmark
- Features Overview - Complete feature catalog
- Defining Benchmarks - Benchmark creation and metadata
- Adding Questions - File extraction and management
- Templates - Creating and customizing answer templates
- Rubrics - Evaluation criteria and trait types
- Verification - Running evaluations and analyzing results
- CLI Verification - Command-line interface for automation
- Saving & Loading - Checkpoints, database, and export
- Deep-Judgment - Extract detailed feedback with excerpts
- Few-Shot Prompting - Guide responses with examples
- Abstention Detection - Handle model refusals
- Embedding Check - Semantic similarity fallback
- Presets - Save and reuse verification configurations
- API Reference - Complete API documentation
- Configuration - Environment variables and defaults
- Troubleshooting - Common issues and solutions
We welcome contributions to Karenina! Please see our contributing guidelines for more information on how to get involved.