Skip to content

[New Feature]: Enhance ballerina test framework to support evaluation #44471

@MohamedSabthar

Description

@MohamedSabthar

Description

Introduce a test-driven evaluation mechanism for AI agents within the Ballerina test framework. This enhancement will allow developers to write evaluation tests that treat AI agent behavior as first-class testable functionality, using familiar bal test workflows. Features include:

  • Aggregate pass/fail evaluation across datasets using a configurable minPassRate threshold
  • Support for multiple evaluation runs (runs field) to account for non-deterministic outputs from AI agents
  • Enhanced test reports with pass rates, dataset inputs, and timestamped historical results

Describe your problem(s)

Integration engineers currently have to write repetitive boilerplate code when creating AI evaluation tests to:

  • Compare actual results with expected values for each dataset entry and record individual outcomes
  • Aggregate outcomes across multiple runs and calculate pass rates across datasets
  • Manually apply threshold-based evaluation for non-deterministic AI outputs

This approach leads to verbose, error-prone, and difficult-to-maintain tests, slowing test development and hindering adoption of AI evaluation best practices.

Describe your solution(s)

The proposed solution addresses these issues through the following enhancements to the test framework.

  • Add minPassRate and runs fields in @test:Config
  • Treat the dataset as a single evaluation test, calculating the aggregate pass rate across entries and runs
  • Fail the test only if the overall pass rate falls below the configured threshold
  • Generate enhanced test reports showing pass rates, dataset inputs, and timestamped outputs

For further details see the spec proposal

Related area

-> Compilation

Related issue(s) (optional)

Related issue: ballerina-platform/ballerina-spec#1402

Suggested label(s) (optional)

No response

Suggested assignee(s) (optional)

No response

Metadata

Metadata

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions