Skip to content

Add evals for all AI workflows #14

@monsieurBoutte

Description

@monsieurBoutte

Add evals for most AI workflows


Description

Create an evaluation framework to measure output quality and consistency across our AI-powered workflows.

Background

We currently have integration tests for functional correctness, but lack evaluations (evals) that measure the quality of AI outputs. As we iterate on prompts, switch models, or add providers, we need a systematic way to detect regressions and compare performance.

Workflows to evaluate

Workflow Output to evaluate
summarization Title, description, and tag relevance/accuracy
moderation Sexual/violence score precision vs ground truth
chapters Chapter boundary accuracy and title quality
burned-in-captions Caption rendering correctness
translate-captions Translation accuracy and fluency
translate-audio TBD

Acceptance Criteria

  • Eval dataset fixtures with labeled examples (input assets + expected/rated outputs)
  • Scoring functions per workflow type (e.g., similarity, precision, human-preference proxy)
  • Support for comparing results across providers (OpenAI, Anthropic, etc.)
  • CI-integrated eval runner with configurable pass/fail thresholds
  • Eval results output (JSON report, markdown summary, or dashboard)

Suggested Approach

  1. Start with summarization and moderation as pilot workflows
  2. Use a small curated dataset (~3-10 examples per workflow)
  3. Define simple metrics first (exact match, factuality, threshold accuracy)
  4. Expand to LLM-as-judge for subjective quality

Related

  • Existing integration tests: tests/integration/
  • Function implementations: src/functions/

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions