[New Feature]: Enhance ballerina test framework to support evaluation

### Description

Introduce a test-driven evaluation mechanism for AI agents within the Ballerina test framework. This enhancement will allow developers to write evaluation tests that treat AI agent behavior as first-class testable functionality, using familiar `bal test` workflows. Features include:

* Aggregate pass/fail evaluation across datasets using a configurable `minPassRate` threshold
* Support for multiple evaluation runs (`runs` field) to account for non-deterministic outputs from AI agents
* Enhanced test reports with pass rates, dataset inputs, and timestamped historical results

### Describe your problem(s)

Integration engineers currently have to write repetitive boilerplate code when creating AI evaluation tests to:

* Compare actual results with expected values for each dataset entry and record individual outcomes
* Aggregate outcomes across multiple runs and calculate pass rates across datasets
* Manually apply threshold-based evaluation for non-deterministic AI outputs

This approach leads to verbose, error-prone, and difficult-to-maintain tests, slowing test development and hindering adoption of AI evaluation best practices.

### Describe your solution(s)

The proposed solution addresses these issues through the following enhancements to the test framework.
  * Add `minPassRate` and `runs` fields in `@test:Config`
  * Treat the dataset as a single evaluation test, calculating the aggregate pass rate across entries and runs
  * Fail the test only if the overall pass rate falls below the configured threshold
  * Generate enhanced test reports showing pass rates, dataset inputs, and timestamped outputs

For further details see the spec [proposal](https://github.com/ballerina-platform/ballerina-spec/pull/1403)

### Related area

-> Compilation

### Related issue(s) (optional)

Related issue: https://github.com/ballerina-platform/ballerina-spec/issues/1402

### Suggested label(s) (optional)

_No response_

### Suggested assignee(s) (optional)

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[New Feature]: Enhance ballerina test framework to support evaluation #44471

Description

Describe your problem(s)

Describe your solution(s)

Related area

Related issue(s) (optional)

Suggested label(s) (optional)

Suggested assignee(s) (optional)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[New Feature]: Enhance ballerina test framework to support evaluation #44471

Description

Description

Describe your problem(s)

Describe your solution(s)

Related area

Related issue(s) (optional)

Suggested label(s) (optional)

Suggested assignee(s) (optional)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions