eval-advisor is a skill for advising, brainstorming, designing, developing, reviewing, and improving AI evaluation systems for LLM applications.
It guides teams through practical evaluation workflows:
- Running error analysis before writing evals
- Choosing the right evaluator type (code assertions, LLM-as-judge, guardrails)
- Validating LLM judges with human labels using TPR/TNR
- Sampling and analyzing traces effectively
- Generating structured synthetic test data when production traces are limited
- Avoiding common eval anti-patterns (generic metrics, Likert scales, unvalidated judges, 100% pass-rate suites)
eval-advisor/SKILL.md: Main skill instructions and trigger guidanceEVAL_MASTER.md: Canonical 12-workflow routing and file-loading indexeval-advisor/references/: Deep-dive reference docs (error analysis, evaluator types, judge validation, sampling, synthetic data, anti-patterns)eval-advisor/workflows/: Actionable checklists and decision treeseval-advisor/templates/: Reusable templates for failure taxonomies, judge prompts, and synthetic data prompts
Use this skill when designing new evals, auditing existing eval suites, selecting evaluation strategies, or diagnosing quality failures in AI systems.
The core philosophy is:
- Look at real failures first (error analysis)
- Use application-specific, binary pass/fail criteria
- Prefer the cheapest reliable evaluator
- Validate any LLM judge rigorously before relying on it
This skill was created from public material released by Hamel Husain.