-
Notifications
You must be signed in to change notification settings - Fork 808
Description
Description
Introduce a test-driven evaluation mechanism for AI agents within the Ballerina test framework. This enhancement will allow developers to write evaluation tests that treat AI agent behavior as first-class testable functionality, using familiar bal test workflows. Features include:
- Aggregate pass/fail evaluation across datasets using a configurable
minPassRatethreshold - Support for multiple evaluation runs (
runsfield) to account for non-deterministic outputs from AI agents - Enhanced test reports with pass rates, dataset inputs, and timestamped historical results
Describe your problem(s)
Integration engineers currently have to write repetitive boilerplate code when creating AI evaluation tests to:
- Compare actual results with expected values for each dataset entry and record individual outcomes
- Aggregate outcomes across multiple runs and calculate pass rates across datasets
- Manually apply threshold-based evaluation for non-deterministic AI outputs
This approach leads to verbose, error-prone, and difficult-to-maintain tests, slowing test development and hindering adoption of AI evaluation best practices.
Describe your solution(s)
The proposed solution addresses these issues through the following enhancements to the test framework.
- Add
minPassRateandrunsfields in@test:Config - Treat the dataset as a single evaluation test, calculating the aggregate pass rate across entries and runs
- Fail the test only if the overall pass rate falls below the configured threshold
- Generate enhanced test reports showing pass rates, dataset inputs, and timestamped outputs
For further details see the spec proposal
Related area
-> Compilation
Related issue(s) (optional)
Related issue: ballerina-platform/ballerina-spec#1402
Suggested label(s) (optional)
No response
Suggested assignee(s) (optional)
No response