Stop Seqera Platform tracing monitoring when consistent connection failure #6823
Stop Seqera Platform tracing monitoring when consistent connection failure #6823
Conversation
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
✅ Deploy Preview for nextflow-docs-staging canceled.
|
pditommaso
left a comment
There was a problem hiding this comment.
Think this may be approached differently. I think this defensive approach (i.e. just logging the error as warns) was done in the early days, to make sure monitoring was not impacting existing pipelines.
However, i'm starting to think, when launching pipelines from/with Platform the lack of telemetry should be consider an hard failure an cause an execution error (ie. exception).
This should be relatively safe because there's already an exponential retry condition in HxClient. Therefore if there's a persistent backend it may be worth just stopping.
Thoughts?
|
Unauthorized/forbidden errors are not in the retriable list, so these errors will produce an immediate failure. However, if telemetry errors produce failures, the head job will finish early and it could reduce the cases for Another strange behaviour related to this topic that I have seen when testing is when providing an incorrect access token. It silently ignores the telemetry even when |
yeah, and that's desirable. that's an error that cannot be recovered
This sounds a +1 for me
Yes, I think it should |
|
Let's add an env var to control the hard stop behaviour. When enabled (default), it stops on this error, when disable legacy (current) warning |
|
Just to get it clearly, with your proposal the stop mechanism implemented in the PR in not required. We stop the pipeline or warn always. Or do you want to keep it when warning? If not required, I will create a new PR and close it once the other is ready. |
|
The new behaviour should report an error ie. stop the run, when disabled just warning as it's now |
|
@pditommaso, the abort on error implemented in #6827 |
d9fa5cd to
d752bc2
Compare
|
Closing. Similar behaviour is implemented in #6827 |
This pull request enhances the robustness of the
TowerClientin thenf-towerplugin by introducing logic to detect and handle consistent failures when communicating with the Seqera Platform. If repeated failures occur over a certain time interval, the client will skip sending further trace reports to avoid unnecessary resource consumption. The changes also include corresponding tests for this new behavior.A consistent failure is determined when more than 20 consecutive trace calls have failed, and the last success was at least 10 alive intervals ago. Nextflow will stop sending traces when failures are constantly repeated for 10-20 mins.
Failure detection and handling improvements:
failuresThreshold,failuresCount,lastSuccess) and constants toTowerClientto monitor the number of consecutive failed trace calls and the time since the last successful call. [1] [2]hasConsistentFailures()that determines if failures have exceeded the threshold and sufficient time has passed since the last success, triggering the skip logic.Testing:
TowerClientTestto verify the consistent failure detection logic under various scenarios. [1] [2]