-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Labels
Description
Add evals for most AI workflows
Description
Create an evaluation framework to measure output quality and consistency across our AI-powered workflows.
Background
We currently have integration tests for functional correctness, but lack evaluations (evals) that measure the quality of AI outputs. As we iterate on prompts, switch models, or add providers, we need a systematic way to detect regressions and compare performance.
Workflows to evaluate
| Workflow | Output to evaluate |
|---|---|
summarization |
Title, description, and tag relevance/accuracy |
moderation |
Sexual/violence score precision vs ground truth |
chapters |
Chapter boundary accuracy and title quality |
burned-in-captions |
Caption rendering correctness |
translate-captions |
Translation accuracy and fluency |
translate-audio |
TBD |
Acceptance Criteria
- Eval dataset fixtures with labeled examples (input assets + expected/rated outputs)
- Scoring functions per workflow type (e.g., similarity, precision, human-preference proxy)
- Support for comparing results across providers (OpenAI, Anthropic, etc.)
- CI-integrated eval runner with configurable pass/fail thresholds
- Eval results output (JSON report, markdown summary, or dashboard)
Suggested Approach
- Start with
summarizationandmoderationas pilot workflows - Use a small curated dataset (~3-10 examples per workflow)
- Define simple metrics first (exact match, factuality, threshold accuracy)
- Expand to LLM-as-judge for subjective quality
Related
- Existing integration tests:
tests/integration/ - Function implementations:
src/functions/
Reactions are currently unavailable