Skip to content

#284 Added failure handling for benchmarks#285

Merged
jathavaan merged 1 commit into
mainfrom
feature/284-register-failed-runs-as-failures
May 20, 2026
Merged

#284 Added failure handling for benchmarks#285
jathavaan merged 1 commit into
mainfrom
feature/284-register-failed-runs-as-failures

Conversation

@jathavaan

Copy link
Copy Markdown
Collaborator

This pull request introduces improved error handling and reporting for benchmarks and Databricks runs, as well as updates to the benchmark result schema. The main enhancements include capturing and recording partial results on failures, adding detailed error diagnostics for failed Databricks tasks, and updating the schema version to V4. These changes increase the robustness and observability of benchmark executions and make troubleshooting easier.

Benchmark error handling and reporting:

  • The benchmark monitor (monitor.py) now captures exceptions during both warmup and timed benchmark iterations, records partial results for failed runs, and logs failure details. Failed runs are saved with a "failed" status, including error messages and partial metrics, and the schema version is updated to V4. [1] [2] [3] [4] [5] [6] [7]

  • The _measure_io utility now returns any exception raised by the measured function, allowing the caller to handle and record errors with associated metrics. [1] [2]

Orchestrator-level error resilience:

  • The benchmark orchestrator (main.py) wraps each experiment run in a try/except block, logging orchestrator-level failures and continuing with remaining experiments instead of aborting the whole batch.

Databricks error diagnostics:

  • The Databricks service now attempts to fetch and append detailed notebook error information from the Databricks API when a run fails, providing richer diagnostics for failed jobs. A helper method _fetch_run_error is introduced for this purpose. [1] [2]

Schema versioning:

  • A new schema version V4 is introduced to support the enhanced benchmark result format.

Signed-off-by: Jathavaan Shankarr <jathavaan12@gmail.com>
@jathavaan jathavaan self-assigned this May 20, 2026
Copilot AI review requested due to automatic review settings May 20, 2026 06:28
@jathavaan jathavaan linked an issue May 20, 2026 that may be closed by this pull request
@jathavaan jathavaan enabled auto-merge May 20, 2026 06:28
@jathavaan jathavaan merged commit b5febb6 into main May 20, 2026
34 checks passed
@jathavaan jathavaan deleted the feature/284-register-failed-runs-as-failures branch May 20, 2026 06:33

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves robustness and observability of benchmark executions by capturing failures (including partial metrics) instead of aborting outright, enriching Databricks failure diagnostics, and bumping the benchmark result schema to v4.

Changes:

  • Extend @monitor benchmarking to record per-iteration success/failed samples and continue saving metadata/cost analytics even when an iteration fails.
  • Add Databricks best-effort retrieval of notebook error / error_trace from runs/get-output when a run finishes unsuccessfully.
  • Update schema version enum to include V4, and make the orchestrator resilient to per-experiment exceptions.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/application/common/monitor.py Adds failure capture, partial sample persistence, and schema v4 fields to benchmark samples.
src/application/common/monitor_utils.py Updates _measure_io to return an exception object instead of raising, enabling caller-controlled failure handling.
src/infra/infrastructure/services/databricks_service.py Fetches supplementary error details for failed Databricks runs via runs/get-output.
src/domain/enums/schema_version.py Adds SchemaVersion.V4.
main.py Wraps each experiment run so the orchestrator continues on per-experiment failures.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +47 to +50
ingress_sum: int = 0
egress_sum: int = 0
start_time = datetime.datetime.now(datetime.UTC)
failure: Exception | None = None
Comment on lines +190 to +192
assert failure_started_at is not None
assert failure_ended_at is not None
assert failure_partial_sample is not None
benchmark_run=benchmark_run,
query_id=query_id,
iteration=iteration,
iteration=failure_iteration or 1,
"schema_version": SchemaVersion.V3.value,
"status": "failed",
"failure_reason": str(failure),
"elapsed_time": None,
Comment on lines +485 to +489
payload = response.json()
error = str(payload.get("error") or "").strip()
error_trace = str(payload.get("error_trace") or "").strip()
parts = [p for p in (error, error_trace) if p]
return "\n".join(parts)
@jathavaan jathavaan restored the feature/284-register-failed-runs-as-failures branch May 20, 2026 06:40
@jathavaan jathavaan deleted the feature/284-register-failed-runs-as-failures branch May 24, 2026 11:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Register failed runs as failures

2 participants