Skip to content

Commit 7811a6d

Browse files
authored
feat: establish robust evaluation framework for workflow benchmarks (#457)
## Overview This PR introduces a robust, automated evaluation framework for Gemini CLI example workflows (Triage, Review, Fixer, Assistant), fulfilling the requirements of #219. ### Key Features - **Isolated `TestRig`**: Secure, concurrent environment using temporary `GEMINI_CLI_HOME` to prevent interference with local settings. - **Mock MCP Server**: A dedicated `mock-mcp-server.ts` providing high-fidelity GitHub API simulation, enabling realistic PR Review benchmarks. - **Gold-Standard Datasets**: Structured benchmarks in `evals/data/` using high-signal technical assertions (e.g., detecting `eval` vulnerabilities or quadratic complexity). - **Nightly Automation**: Integrated GitHub Action matrix testing across 5 Gemini models. - **Automated Reporting**: Aggregate reporting script for GitHub Job Summaries. ### Next Steps - **Data Expansion**: Add more complex edge cases to the existing datasets. - **Prompt Tuning**: Use this baseline to fine-tune the workflow prompts for even higher reliability. Related to: #219
1 parent b7c22b0 commit 7811a6d

20 files changed

+4148
-211
lines changed
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
name: 'Nightly Evaluations'
2+
3+
on:
4+
schedule:
5+
- cron: '0 1 * * *' # 1 AM UTC
6+
workflow_dispatch:
7+
inputs:
8+
iterations:
9+
description: 'Number of iterations per test case'
10+
required: true
11+
default: '1'
12+
13+
jobs:
14+
evaluate:
15+
runs-on: 'ubuntu-latest'
16+
permissions:
17+
contents: 'read'
18+
strategy:
19+
matrix:
20+
model:
21+
[
22+
'gemini-3-pro-preview',
23+
'gemini-3-flash-preview',
24+
'gemini-2.5-pro',
25+
'gemini-2.5-flash',
26+
'gemini-2.5-flash-lite',
27+
]
28+
name: 'Evaluate ${{ matrix.model }}'
29+
30+
steps:
31+
- name: 'Checkout code'
32+
uses: 'actions/checkout@v4' # ratchet:exclude
33+
34+
- name: 'Set up Node.js'
35+
uses: 'actions/setup-node@v4' # ratchet:exclude
36+
with:
37+
node-version: '20'
38+
cache: 'npm'
39+
40+
- name: 'Install dependencies'
41+
run: |
42+
npm ci
43+
44+
- name: 'Install Gemini CLI'
45+
run: 'npm install -g @google/gemini-cli@latest'
46+
47+
- name: 'Run Evaluations'
48+
env:
49+
GEMINI_API_KEY: '${{ secrets.GEMINI_API_KEY }}'
50+
GEMINI_MODEL: '${{ matrix.model }}'
51+
run: |
52+
npm run test:evals -- --reporter=json --outputFile=eval-results-${{ matrix.model }}.json
53+
54+
- name: 'Upload Results'
55+
if: 'always()'
56+
uses: 'actions/upload-artifact@v4' # ratchet:exclude
57+
with:
58+
name: 'eval-results-${{ matrix.model }}'
59+
path: 'eval-results-${{ matrix.model }}.json'
60+
61+
- name: 'Job Summary'
62+
if: 'always()'
63+
run: |
64+
npx tsx scripts/aggregate_evals.ts "eval-results-${{ matrix.model }}.json" >> "$GITHUB_STEP_SUMMARY"

evals/README.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# Gemini CLI Workflow Evaluations
2+
3+
This directory contains resources for evaluating and improving the example workflows using a TypeScript + Vitest framework.
4+
5+
## Goals
6+
7+
1. **Systematic Testing:** Ensure changes to prompts or configurations improve quality.
8+
2. **Regression Testing:** Catch degradations in performance.
9+
3. **Benchmarking:** Compare different models (e.g., `gemini-2.5-pro` vs `gemini-2.5-flash`).
10+
11+
## Structure
12+
13+
- `evals/`:
14+
- `test-rig.ts`: Utility to setup a temporary environment for the CLI.
15+
- `issue-triage.eval.ts`: Benchmark for the Issue Triage workflow.
16+
- `pr-review.eval.ts`: Benchmark for the PR Review workflow.
17+
- `issue-fixer.eval.ts`: Benchmark for the autonomous Issue Fixer.
18+
- `gemini-assistant.eval.ts`: Benchmark for the interactive Assistant.
19+
- `gemini-scheduled-triage.eval.ts`: Benchmark for batch triage.
20+
- `data/*.jsonl`: Gold-standard datasets for each workflow.
21+
- `vitest.config.ts`: Configuration for the evaluation runner.
22+
23+
## How to Run
24+
25+
### Prerequisites
26+
27+
- `npm install`
28+
- `gemini-cli` installed and available in your PATH.
29+
- `GEMINI_API_KEY` environment variable set.
30+
31+
### Run Locally
32+
33+
```bash
34+
npm run test:evals
35+
```
36+
37+
To run against a specific model:
38+
39+
```bash
40+
GEMINI_MODEL=gemini-2.5-flash npm run test:evals
41+
```
42+
43+
## Adding New Evals
44+
45+
1. Create a new file in `evals/` ending in `.eval.ts`.
46+
2. Add corresponding test data in `evals/data/`.
47+
3. Use the `TestRig` to set up files, environment variables, and run the CLI.
48+
4. Assert the expected behavior (e.g., check `GITHUB_ENV` output or tool calls captured in telemetry).

evals/data/gemini-assistant.json

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
[
2+
{
3+
"id": "fix-typo",
4+
"inputs": {
5+
"TITLE": "Fix typo in utils.js",
6+
"DESCRIPTION": "There is a typo in the helper function name.",
7+
"EVENT_NAME": "issues",
8+
"IS_PULL_REQUEST": "false",
9+
"ISSUE_NUMBER": "10",
10+
"REPOSITORY": "owner/repo",
11+
"ADDITIONAL_CONTEXT": "Please fix it."
12+
},
13+
"expected_actions": ["AI Assistant: Plan of Action"],
14+
"expected_plan_keywords": ["search", "grep", "read", "replace", "utils.js"]
15+
},
16+
{
17+
"id": "add-feature",
18+
"inputs": {
19+
"TITLE": "Add login page",
20+
"DESCRIPTION": "We need a login page.",
21+
"EVENT_NAME": "issues",
22+
"IS_PULL_REQUEST": "false",
23+
"ISSUE_NUMBER": "11",
24+
"REPOSITORY": "owner/repo",
25+
"ADDITIONAL_CONTEXT": "Make it pretty."
26+
},
27+
"expected_actions": ["AI Assistant: Plan of Action"],
28+
"expected_plan_keywords": [
29+
"create",
30+
"component",
31+
"structure",
32+
"design",
33+
"implement"
34+
]
35+
}
36+
]
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
[
2+
{
3+
"id": "batch-1",
4+
"inputs": {
5+
"AVAILABLE_LABELS": "bug,enhancement,priority/p0",
6+
"ISSUES_TO_TRIAGE": "[{\"number\": 1, \"title\": \"Crash on start\", \"body\": \"It crashes immediately.\"}, {\"number\": 2, \"title\": \"Add help button\", \"body\": \"Users need help.\"}]"
7+
},
8+
"expected": [
9+
{
10+
"issue_number": 1,
11+
"labels_to_set": ["bug", "priority/p0"]
12+
},
13+
{
14+
"issue_number": 2,
15+
"labels_to_set": ["enhancement"]
16+
}
17+
]
18+
}
19+
]

evals/data/issue-fixer.json

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
[
2+
{
3+
"id": "new-page-request",
4+
"inputs": {
5+
"REPOSITORY": "owner/repo",
6+
"ISSUE_NUMBER": "1",
7+
"ISSUE_TITLE": "Add a new landing page",
8+
"ISSUE_BODY": "We need a landing page for the new product launch."
9+
},
10+
"expected_actions": ["update_issue", "gh issue comment"],
11+
"expected_plan_keywords": ["explore", "create", "file", "add", "content"]
12+
},
13+
{
14+
"id": "bug-fix-request",
15+
"inputs": {
16+
"REPOSITORY": "owner/repo",
17+
"ISSUE_NUMBER": "2",
18+
"ISSUE_TITLE": "Fix login crash",
19+
"ISSUE_BODY": "The app crashes when the user clicks 'forgot password'."
20+
},
21+
"expected_actions": ["update_issue", "gh issue comment"],
22+
"expected_plan_keywords": [
23+
"search",
24+
"reproduce",
25+
"investigate",
26+
"fix",
27+
"logic"
28+
]
29+
},
30+
{
31+
"id": "dependency-update",
32+
"inputs": {
33+
"REPOSITORY": "owner/repo",
34+
"ISSUE_NUMBER": "5",
35+
"ISSUE_TITLE": "Update lodash to the latest version",
36+
"ISSUE_BODY": "We need to update lodash to address a known security vulnerability in older versions."
37+
},
38+
"expected_actions": ["update_issue", "gh issue comment"],
39+
"expected_plan_keywords": [
40+
"npm",
41+
"install",
42+
"update",
43+
"package.json",
44+
"verify"
45+
]
46+
}
47+
]

evals/data/issue-triage.json

Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
[
2+
{
3+
"id": "bug-1",
4+
"inputs": {
5+
"ISSUE_TITLE": "Application crashes on startup",
6+
"ISSUE_BODY": "When I launch the app, it immediately closes with a segfault.",
7+
"AVAILABLE_LABELS": "bug,enhancement,question,documentation,security,duplicate,wontfix"
8+
},
9+
"expected": ["bug"],
10+
"reason": "Explicit mention of crash and segfault."
11+
},
12+
{
13+
"id": "feature-1",
14+
"inputs": {
15+
"ISSUE_TITLE": "Add dark mode",
16+
"ISSUE_BODY": "It would be great to have a dark mode for better visibility at night.",
17+
"AVAILABLE_LABELS": "bug,enhancement,question,documentation,security,duplicate,wontfix"
18+
},
19+
"expected": ["enhancement"],
20+
"reason": "Request for a new feature (dark mode)."
21+
},
22+
{
23+
"id": "question-1",
24+
"inputs": {
25+
"ISSUE_TITLE": "How to run tests?",
26+
"ISSUE_BODY": "I cannot find instructions on running the unit tests.",
27+
"AVAILABLE_LABELS": "bug,enhancement,question,documentation,security,duplicate,wontfix"
28+
},
29+
"expected": ["question", "documentation"],
30+
"reason": "Asking for information/instructions regarding documentation."
31+
},
32+
{
33+
"id": "security-1",
34+
"inputs": {
35+
"ISSUE_TITLE": "SQL Injection vulnerability in login form",
36+
"ISSUE_BODY": "I found a way to bypass login using SQL injection on the username field.",
37+
"AVAILABLE_LABELS": "bug,enhancement,question,documentation,security,duplicate,wontfix"
38+
},
39+
"expected": ["bug", "security"],
40+
"reason": "Specific security vulnerability mentioned."
41+
},
42+
{
43+
"id": "empty-body",
44+
"inputs": {
45+
"ISSUE_TITLE": "Feature request: support pnpm",
46+
"ISSUE_BODY": "",
47+
"AVAILABLE_LABELS": "bug,enhancement,question,documentation,security,duplicate,wontfix"
48+
},
49+
"expected": ["enhancement"],
50+
"reason": "Title clearly indicates a feature request despite empty body."
51+
},
52+
{
53+
"id": "vague-bug",
54+
"inputs": {
55+
"ISSUE_TITLE": "It broke",
56+
"ISSUE_BODY": "I was using it and then it just stopped working. No error message.",
57+
"AVAILABLE_LABELS": "bug,enhancement,question,documentation,security,duplicate,wontfix"
58+
},
59+
"expected": ["bug"],
60+
"reason": "Functional failure reported."
61+
},
62+
{
63+
"id": "translation-req",
64+
"inputs": {
65+
"ISSUE_TITLE": "Traducción al español",
66+
"ISSUE_BODY": "Necesitamos traducir la documentación al español.",
67+
"AVAILABLE_LABELS": "bug,enhancement,question,documentation,security,duplicate,wontfix"
68+
},
69+
"expected": ["documentation", "enhancement"],
70+
"reason": "Request for documentation work in another language."
71+
}
72+
]

evals/data/pr-review.json

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
[
2+
{
3+
"id": "logic-error",
4+
"inputs": {
5+
"REPOSITORY": "google-github-actions/run-gemini-cli",
6+
"PULL_REQUEST_NUMBER": "454",
7+
"ADDITIONAL_CONTEXT": "Focus on logic errors and edge cases."
8+
},
9+
"expected_tools": [
10+
"pull_request_read.get_diff",
11+
"add_comment_to_pending_review"
12+
],
13+
"expected_findings": ["eval", "untrusted", "calculation", "input"]
14+
},
15+
{
16+
"id": "security-vulnerability",
17+
"inputs": {
18+
"REPOSITORY": "google-github-actions/run-gemini-cli",
19+
"PULL_REQUEST_NUMBER": "454",
20+
"ADDITIONAL_CONTEXT": "Security review requested. Check for injection and data exposure."
21+
},
22+
"expected_tools": [
23+
"pull_request_read.get_diff",
24+
"add_comment_to_pending_review"
25+
],
26+
"expected_findings": ["eval", "injection", "arbitrary", "execution"]
27+
},
28+
{
29+
"id": "performance-optimization",
30+
"inputs": {
31+
"REPOSITORY": "google-github-actions/run-gemini-cli",
32+
"PULL_REQUEST_NUMBER": "454",
33+
"ADDITIONAL_CONTEXT": "The current implementation is slow on large datasets. Look for performance bottlenecks."
34+
},
35+
"expected_tools": [
36+
"pull_request_read.get_diff",
37+
"add_comment_to_pending_review"
38+
],
39+
"expected_findings": ["nested", "loop", "quadratic", "n^2"]
40+
}
41+
]

0 commit comments

Comments
 (0)