Add evals for all AI workflows

# Add evals for most AI workflows

---

## Description

Create an evaluation framework to measure output quality and consistency across our AI-powered workflows.

## Background

We currently have integration tests for functional correctness, but lack evaluations (evals) that measure the **quality** of AI outputs. As we iterate on prompts, switch models, or add providers, we need a systematic way to detect regressions and compare performance.

## Workflows to evaluate

| Workflow | Output to evaluate |
|----------|-------------------|
| `summarization` | Title, description, and tag relevance/accuracy |
| `moderation` | Sexual/violence score precision vs ground truth |
| `chapters` | Chapter boundary accuracy and title quality |
| `burned-in-captions` | Caption rendering correctness |
| `translate-captions` | Translation accuracy and fluency |
| `translate-audio` | TBD |

## Acceptance Criteria

- [ ] Eval dataset fixtures with labeled examples (input assets + expected/rated outputs)
- [ ] Scoring functions per workflow type (e.g., similarity, precision, human-preference proxy)
- [ ] Support for comparing results across providers (OpenAI, Anthropic, etc.)
- [ ] CI-integrated eval runner with configurable pass/fail thresholds
- [ ] Eval results output (JSON report, markdown summary, or dashboard)

## Suggested Approach

1. Start with `summarization` and `moderation` as pilot workflows
2. Use a small curated dataset (~3-10 examples per workflow)
3. Define simple metrics first (exact match, factuality, threshold accuracy)
4. Expand to LLM-as-judge for subjective quality

## Related

- Existing integration tests: `tests/integration/`
- Function implementations: `src/functions/`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add evals for all AI workflows #14

Add evals for most AI workflows

Description

Background

Workflows to evaluate

Acceptance Criteria

Suggested Approach

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Workflow	Output to evaluate
`summarization`	Title, description, and tag relevance/accuracy
`moderation`	Sexual/violence score precision vs ground truth
`chapters`	Chapter boundary accuracy and title quality
`burned-in-captions`	Caption rendering correctness
`translate-captions`	Translation accuracy and fluency
`translate-audio`	TBD

Add evals for all AI workflows #14

Description

Add evals for most AI workflows

Description

Background

Workflows to evaluate

Acceptance Criteria

Suggested Approach

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions