Stop Seqera Platform tracing monitoring when consistent connection failure by jorgee · Pull Request #6823 · nextflow-io/nextflow

jorgee · 2026-02-10T18:21:41Z

This pull request enhances the robustness of the TowerClient in the nf-tower plugin by introducing logic to detect and handle consistent failures when communicating with the Seqera Platform. If repeated failures occur over a certain time interval, the client will skip sending further trace reports to avoid unnecessary resource consumption. The changes also include corresponding tests for this new behavior.

A consistent failure is determined when more than 20 consecutive trace calls have failed, and the last success was at least 10 alive intervals ago. Nextflow will stop sending traces when failures are constantly repeated for 10-20 mins.

Failure detection and handling improvements:

Added failure tracking fields (failuresThreshold, failuresCount, lastSuccess) and constants to TowerClient to monitor the number of consecutive failed trace calls and the time since the last successful call. [1] [2]
Implemented a new method hasConsistentFailures() that determines if failures have exceeded the threshold and sufficient time has passed since the last success, triggering the skip logic.
Modified the main sending loop to check for consistent failures and, if detected, skip sending heartbeat and progress reports, and clear pending tasks to prevent memory issues.
Updated HTTP response handling to reset failure counters on success and increment them on failure, while also updating the last success timestamp. [1] [2]

Testing:

Added a new test in TowerClientTest to verify the consistent failure detection logic under various scenarios. [1] [2]

Signed-off-by: jorgee <jorge.ejarque@seqera.io>

netlify · 2026-02-10T18:21:47Z

✅ Deploy Preview for nextflow-docs-staging canceled.

Name	Link
🔨 Latest commit	`e73aed5`
🔍 Latest deploy log	https://app.netlify.com/projects/nextflow-docs-staging/deploys/698b77af83bed90008e79e16

…-failures

pditommaso

Think this may be approached differently. I think this defensive approach (i.e. just logging the error as warns) was done in the early days, to make sure monitoring was not impacting existing pipelines.

However, i'm starting to think, when launching pipelines from/with Platform the lack of telemetry should be consider an hard failure an cause an execution error (ie. exception).

This should be relatively safe because there's already an exponential retry condition in HxClient. Therefore if there's a persistent backend it may be worth just stopping.

Thoughts?

jorgee · 2026-02-11T11:34:25Z

Unauthorized/forbidden errors are not in the retriable list, so these errors will produce an immediate failure. However, if telemetry errors produce failures, the head job will finish early and it could reduce the cases for unknown statuses in pipelines.

Another strange behaviour related to this topic that I have seen when testing is when providing an incorrect access token. It silently ignores the telemetry even when --with-tower option is included. There is a error message in .nextflow.log but not in the stdout. I don't know the reason why it is working in that way, but in the new behaviour that @pditommaso mentioned, it should also produce an error, shouldn't it?

pditommaso · 2026-02-11T11:59:50Z

Unauthorized/forbidden errors are not in the retriable list, so these errors will produce an immediate failure

yeah, and that's desirable. that's an error that cannot be recovered

However, if telemetry errors produce failures, the head job will finish early and it could reduce the cases for unknown statuses in pipelines.

This sounds a +1 for me

There is a error message in .nextflow.log but not in the stdout. I don't know the reason why it is working in that way, but in the new behaviour that @pditommaso mentioned, it should also produce an error, shouldn't it?

Yes, I think it should

pditommaso · 2026-02-11T12:01:49Z

Let's add an env var to control the hard stop behaviour. When enabled (default), it stops on this error, when disable legacy (current) warning

jorgee · 2026-02-11T12:10:05Z

Just to get it clearly, with your proposal the stop mechanism implemented in the PR in not required. We stop the pipeline or warn always. Or do you want to keep it when warning? If not required, I will create a new PR and close it once the other is ready.

pditommaso · 2026-02-11T12:55:36Z

The new behaviour should report an error ie. stop the run, when disabled just warning as it's now

jorgee · 2026-02-12T13:02:49Z

@pditommaso, the abort on error implemented in #6827

jorgee · 2026-04-14T16:16:48Z

Closing. Similar behaviour is implemented in #6827

Add consistent failure checking in tower trace monitoring

5bfbdd0

Signed-off-by: jorgee <jorge.ejarque@seqera.io>

Merge branch 'master' into stop-tower-tracing-monitor-when-consistent…

e73aed5

…-failures

jorgee requested review from bentsherman and pditommaso February 10, 2026 18:30

pditommaso requested changes Feb 11, 2026

View reviewed changes

pditommaso force-pushed the master branch 2 times, most recently from d9fa5cd to d752bc2 Compare February 28, 2026 13:10

pditommaso force-pushed the master branch from 6fe40e1 to ea1f4ea Compare March 17, 2026 19:46

jorgee mentioned this pull request Apr 14, 2026

Abort execution when platform telemetry error #6827

Open

jorgee closed this Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop Seqera Platform tracing monitoring when consistent connection failure #6823

Stop Seqera Platform tracing monitoring when consistent connection failure #6823
jorgee wants to merge 2 commits intomasterfrom
stop-tower-tracing-monitor-when-consistent-failures

jorgee commented Feb 10, 2026 •

edited

Loading

Uh oh!

netlify bot commented Feb 10, 2026 •

edited

Loading

Uh oh!

pditommaso left a comment

Uh oh!

jorgee commented Feb 11, 2026

Uh oh!

pditommaso commented Feb 11, 2026

Uh oh!

pditommaso commented Feb 11, 2026

Uh oh!

jorgee commented Feb 11, 2026 •

edited

Loading

Uh oh!

pditommaso commented Feb 11, 2026

Uh oh!

jorgee commented Feb 12, 2026

Uh oh!

jorgee commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jorgee commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for nextflow-docs-staging canceled.

Uh oh!

pditommaso left a comment

Choose a reason for hiding this comment

Uh oh!

jorgee commented Feb 11, 2026

Uh oh!

pditommaso commented Feb 11, 2026

Uh oh!

pditommaso commented Feb 11, 2026

Uh oh!

jorgee commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pditommaso commented Feb 11, 2026

Uh oh!

jorgee commented Feb 12, 2026

Uh oh!

jorgee commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jorgee commented Feb 10, 2026 •

edited

Loading

netlify bot commented Feb 10, 2026 •

edited

Loading

jorgee commented Feb 11, 2026 •

edited

Loading