Consolidated model evaluation framework for the matric ecosystem.
v0.1.0 - Production-ready with 1500+ tests passing.
Standardized benchmarking of LLM models across multiple inference providers:
- Public benchmarks: HumanEval, MBPP, GSM8K, ARC, IFEval, LiveCodeBench, DS-1000, MMLU, MT-Bench
- Custom tests: Application-specific evaluations for matric-cli and matric-memory
- Tool calling: 6-scenario evaluation with correctness scoring
- LLM-as-Judge: Multi-turn conversation and reasoning assessment
- Multi-provider: Evaluate across Ollama, vLLM, llama.cpp, OpenRouter, and Chutes
- Thinking models: Extended reasoning support with thinking-on/off modes
# From Gitea PyPI registry
pip install matric-eval --index-url https://git.integrolabs.net/api/packages/roctinam/pypi/simple/
# Or install from source
git clone https://git.integrolabs.net/roctinam/matric-eval.git
cd matric-eval
uv sync# Smoke test on a specific model (defaults to Ollama)
matric-eval run --tier smoke --model llama3.2:3b
# Use a different provider
matric-eval run --provider vllm --model meta-llama/Llama-3.2-3B --tier smoke
matric-eval run --provider openrouter --api-key $OPENROUTER_API_KEY --model anthropic/claude-3.5-sonnet
# Multi-provider matrix evaluation
matric-eval run --matrix eval-matrix.yaml
# List available providers and their status
matric-eval list-providers --check-availability
# List available benchmarks
matric-eval list-benchmarks
# List available Ollama models
matric-eval list-models
# Get model recommendations from results
matric-eval recommend --results-dir ./results
# Validate run completeness
matric-eval validate --results-dir ./results| Command | Description |
|---|---|
run |
Run model evaluation with tier and provider selection |
list-benchmarks |
List available benchmarks with descriptions |
list-models |
List available Ollama models |
list-providers |
List available inference providers |
recommend |
Generate model recommendations from results |
validate |
Check run completeness and identify gaps |
| Provider | Type | CLI Flag | Description |
|---|---|---|---|
| Ollama | Local | --provider ollama |
Default. Local Ollama instance |
| llama.cpp | Local | --provider llama-cpp |
Direct GGUF model serving |
| vLLM | Local/Cloud | --provider vllm |
High-throughput GPU inference |
| OpenRouter | Cloud | --provider openrouter |
100+ models via unified API |
| Chutes | Cloud | --provider chutes |
Serverless GPU inference |
| Tier | Tests per Benchmark | Duration | Use Case |
|---|---|---|---|
| smoke | 5 | ~2 min | Quick validation |
| quick | 75 | ~20 min | Statistical sampling |
| full | all | ~2+ hours | Complete evaluation |
| Benchmark | Category | Tests | Description |
|---|---|---|---|
| HumanEval | Code Generation | 164 | Function completion |
| MBPP | Code Generation | 974 | Python problems |
| GSM8K | Math Reasoning | 1,319 | Grade school math |
| ARC | Reasoning | 1,172 | Science questions |
| IFEval | Instruction Following | 541 | Constraint checking |
| LiveCodeBench | Competitive Programming | 1,055 | Contest problems (release_v6) |
| DS-1000 | Data Science | 1,000 | Pandas/NumPy tasks |
| MMLU | Knowledge | 14,042 | Multiple choice questions (57 subjects) |
| MT-Bench | Multi-turn | 80 | Conversation quality |
| Tool Calling | Agentic | 6 | Function invocation |
Application -> matric-eval
1. DISCOVER -> Query provider for available models
2. PUBLIC -> Run standard benchmarks via Inspect AI
3. RANK -> Filter top performers
4. CUSTOM -> Run app-specific tests
5. CONFIG -> Generate recommendations
Provider Abstraction:
CLI -> EvaluationEngine -> Provider -> Inspect AI -> Backend
|
+----------+----------+
| | |
Ollama vLLM OpenRouter ...
For multi-provider comparison, create a YAML matrix config:
evaluation:
models:
- llama3.2:3b
- mistral:7b
providers:
- ollama
- vllm
benchmarks:
- humaneval
- gsm8k
tier: smoke
matrix:
mode: cartesian
exclude:
- model: mistral:7b
provider: vllmThen run: matric-eval run --matrix eval-matrix.yaml
For matric-cli integration:
npm install @matric/eval-client --registry https://git.integrolabs.net/api/packages/roctinam/npm/import { createClient } from '@matric/eval-client';
const client = createClient();
const results = await client.run({ tier: 'smoke', models: ['llama3.2:3b'] });
const recommendations = await client.recommend({ resultsDir: './results' });Drop any git repo or directory into datasets/ and it's auto-discovered as a benchmark:
# Clone a dataset repo
git clone https://example.com/my-eval-data.git datasets/my-eval
# Or add as submodule
git submodule add https://example.com/my-eval-data.git datasets/my-eval
# It just works
matric-eval list-benchmarks # shows "my-eval"
matric-eval run --benchmark my-eval --tier smoke --model llama3.2:3bZero config for JSONL files with input/target fields. For more control, add a dataset.yaml:
name: my-benchmark
description: Domain-specific evaluation
scorer: match
tiers: { smoke: 5, quick: 50, full: 0 }
field_mapping: { input: question, target: answer }Configure dataset root: EVAL_DATASETS_DIR=/path/to/datasets
- Multi-Provider: Evaluate across Ollama, vLLM, llama.cpp, OpenRouter, Chutes
- External Datasets: Auto-discover datasets from git clones/submodules with zero config
- Thinking Models: Extended reasoning support with auto-detection
- Checkpoint/Resume: Fault-tolerant evaluation with automatic recovery
- Evaluation Matrix: YAML-based multi-provider comparison runs
- Parallel Execution: Concurrent model evaluation
- Structured Logging: JSON logs for observability
- Model Recommendations: Capability-based model selection
- 1500+ Tests: Comprehensive test suite with 80%+ coverage
- docs/ - Full project documentation
- Architecture - System design
- Requirements - Vision and use cases
- Testing - Development workflow
- CLAUDE.md - AI assistant context
MIT