#284 Added failure handling for benchmarks#285
Merged
Conversation
Signed-off-by: Jathavaan Shankarr <jathavaan12@gmail.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves robustness and observability of benchmark executions by capturing failures (including partial metrics) instead of aborting outright, enriching Databricks failure diagnostics, and bumping the benchmark result schema to v4.
Changes:
- Extend
@monitorbenchmarking to record per-iteration success/failed samples and continue saving metadata/cost analytics even when an iteration fails. - Add Databricks best-effort retrieval of notebook
error/error_tracefromruns/get-outputwhen a run finishes unsuccessfully. - Update schema version enum to include
V4, and make the orchestrator resilient to per-experiment exceptions.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
src/application/common/monitor.py |
Adds failure capture, partial sample persistence, and schema v4 fields to benchmark samples. |
src/application/common/monitor_utils.py |
Updates _measure_io to return an exception object instead of raising, enabling caller-controlled failure handling. |
src/infra/infrastructure/services/databricks_service.py |
Fetches supplementary error details for failed Databricks runs via runs/get-output. |
src/domain/enums/schema_version.py |
Adds SchemaVersion.V4. |
main.py |
Wraps each experiment run so the orchestrator continues on per-experiment failures. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+47
to
+50
| ingress_sum: int = 0 | ||
| egress_sum: int = 0 | ||
| start_time = datetime.datetime.now(datetime.UTC) | ||
| failure: Exception | None = None |
Comment on lines
+190
to
+192
| assert failure_started_at is not None | ||
| assert failure_ended_at is not None | ||
| assert failure_partial_sample is not None |
| benchmark_run=benchmark_run, | ||
| query_id=query_id, | ||
| iteration=iteration, | ||
| iteration=failure_iteration or 1, |
| "schema_version": SchemaVersion.V3.value, | ||
| "status": "failed", | ||
| "failure_reason": str(failure), | ||
| "elapsed_time": None, |
Comment on lines
+485
to
+489
| payload = response.json() | ||
| error = str(payload.get("error") or "").strip() | ||
| error_trace = str(payload.get("error_trace") or "").strip() | ||
| parts = [p for p in (error, error_trace) if p] | ||
| return "\n".join(parts) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces improved error handling and reporting for benchmarks and Databricks runs, as well as updates to the benchmark result schema. The main enhancements include capturing and recording partial results on failures, adding detailed error diagnostics for failed Databricks tasks, and updating the schema version to V4. These changes increase the robustness and observability of benchmark executions and make troubleshooting easier.
Benchmark error handling and reporting:
The benchmark monitor (
monitor.py) now captures exceptions during both warmup and timed benchmark iterations, records partial results for failed runs, and logs failure details. Failed runs are saved with a"failed"status, including error messages and partial metrics, and the schema version is updated to V4. [1] [2] [3] [4] [5] [6] [7]The
_measure_ioutility now returns any exception raised by the measured function, allowing the caller to handle and record errors with associated metrics. [1] [2]Orchestrator-level error resilience:
main.py) wraps each experiment run in a try/except block, logging orchestrator-level failures and continuing with remaining experiments instead of aborting the whole batch.Databricks error diagnostics:
_fetch_run_erroris introduced for this purpose. [1] [2]Schema versioning:
V4is introduced to support the enhanced benchmark result format.