#284 Added failure handling for benchmarks by jathavaan · Pull Request #285 · kartAI/doppa

jathavaan · 2026-05-20T06:28:03Z

This pull request introduces improved error handling and reporting for benchmarks and Databricks runs, as well as updates to the benchmark result schema. The main enhancements include capturing and recording partial results on failures, adding detailed error diagnostics for failed Databricks tasks, and updating the schema version to V4. These changes increase the robustness and observability of benchmark executions and make troubleshooting easier.

Benchmark error handling and reporting:

The benchmark monitor (monitor.py) now captures exceptions during both warmup and timed benchmark iterations, records partial results for failed runs, and logs failure details. Failed runs are saved with a "failed" status, including error messages and partial metrics, and the schema version is updated to V4. [1] [2] [3] [4] [5] [6] [7]
The _measure_io utility now returns any exception raised by the measured function, allowing the caller to handle and record errors with associated metrics. [1] [2]

Orchestrator-level error resilience:

The benchmark orchestrator (main.py) wraps each experiment run in a try/except block, logging orchestrator-level failures and continuing with remaining experiments instead of aborting the whole batch.

Databricks error diagnostics:

The Databricks service now attempts to fetch and append detailed notebook error information from the Databricks API when a run fails, providing richer diagnostics for failed jobs. A helper method _fetch_run_error is introduced for this purpose. [1] [2]

Schema versioning:

A new schema version V4 is introduced to support the enhanced benchmark result format.

Signed-off-by: Jathavaan Shankarr <jathavaan12@gmail.com>

Copilot

Pull request overview

This PR improves robustness and observability of benchmark executions by capturing failures (including partial metrics) instead of aborting outright, enriching Databricks failure diagnostics, and bumping the benchmark result schema to v4.

Changes:

Extend @monitor benchmarking to record per-iteration success/failed samples and continue saving metadata/cost analytics even when an iteration fails.
Add Databricks best-effort retrieval of notebook error / error_trace from runs/get-output when a run finishes unsuccessfully.
Update schema version enum to include V4, and make the orchestrator resilient to per-experiment exceptions.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`src/application/common/monitor.py`	Adds failure capture, partial sample persistence, and schema v4 fields to benchmark samples.
`src/application/common/monitor_utils.py`	Updates `_measure_io` to return an exception object instead of raising, enabling caller-controlled failure handling.
`src/infra/infrastructure/services/databricks_service.py`	Fetches supplementary error details for failed Databricks runs via `runs/get-output`.
`src/domain/enums/schema_version.py`	Adds `SchemaVersion.V4`.
`main.py`	Wraps each experiment run so the orchestrator continues on per-experiment failures.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+            ingress_sum: int = 0
+            egress_sum: int = 0
+            start_time = datetime.datetime.now(datetime.UTC)
+            failure: Exception | None = None


+                assert failure_started_at is not None
+                assert failure_ended_at is not None
+                assert failure_partial_sample is not None


                    benchmark_run=benchmark_run,
                    query_id=query_id,
-                    iteration=iteration,
+                    iteration=failure_iteration or 1,


-                            "schema_version": SchemaVersion.V3.value,
+                            "status": "failed",
+                            "failure_reason": str(failure),
+                            "elapsed_time": None,


+        payload = response.json()
+        error = str(payload.get("error") or "").strip()
+        error_trace = str(payload.get("error_trace") or "").strip()
+        parts = [p for p in (error, error_trace) if p]
+        return "\n".join(parts)


#284 Added failure handling for benchmarks

c8d18e6

Signed-off-by: Jathavaan Shankarr <jathavaan12@gmail.com>

jathavaan self-assigned this May 20, 2026

Copilot AI review requested due to automatic review settings May 20, 2026 06:28

jathavaan linked an issue May 20, 2026 that may be closed by this pull request

Register failed runs as failures #284

Closed

jathavaan enabled auto-merge May 20, 2026 06:28

Copilot started reviewing on behalf of jathavaan May 20, 2026 06:28 View session

jathavaan merged commit b5febb6 into main May 20, 2026
34 checks passed

jathavaan deleted the feature/284-register-failed-runs-as-failures branch May 20, 2026 06:33

Copilot AI reviewed May 20, 2026

View reviewed changes

jathavaan restored the feature/284-register-failed-runs-as-failures branch May 20, 2026 06:40

jathavaan deleted the feature/284-register-failed-runs-as-failures branch May 24, 2026 11:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#284 Added failure handling for benchmarks#285

#284 Added failure handling for benchmarks#285
jathavaan merged 1 commit into
mainfrom
feature/284-register-failed-runs-as-failures

jathavaan commented May 20, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jathavaan commented May 20, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants