Reporting transient or permanent failures, and job retries

Recently, the `ParserFailureRateTooHighOrMissing ` alert fired https://github.com/m-lab/dev-tracker/issues/727 due to an actual spike in task errors (individual archives).

Upon investigation, it was due to `ETLSourceError`, which can be due to transient connectivity problems between the parser and GCS API servers. This is something we cannot control directly. The alert resolved on its own when the connectivity was restored.

Ideally:
* the data pipeline alerts should not fire on recoverable, transient events.
* the data pipeline should differentiate between errors that are transient or permanent. (where possible)
* the data pipeline (parser or gardener, as appropriate) should retry until some absolute threshold was reached and the task abandoned.

Currently:
* the parser tries to open a task archive, if that fails, it stops processing that task, and does not always report the error to gardener.
* the gardener only appears to update task state with states `Parsing` and `ParsingComplete` with heartbeats with periodic heartbeats.
* the gardener will retry failed bq jobs, but does not appear to retry tasks issued by the `/v2/jobs/next` API (or the control path is very opaque).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reporting transient or permanent failures, and job retries #1100

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reporting transient or permanent failures, and job retries #1100

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions