Skip to content

Stop Seqera Platform tracing monitoring when consistent connection failure #6823

Closed
jorgee wants to merge 2 commits intomasterfrom
stop-tower-tracing-monitor-when-consistent-failures
Closed

Stop Seqera Platform tracing monitoring when consistent connection failure #6823
jorgee wants to merge 2 commits intomasterfrom
stop-tower-tracing-monitor-when-consistent-failures

Conversation

@jorgee
Copy link
Copy Markdown
Contributor

@jorgee jorgee commented Feb 10, 2026

This pull request enhances the robustness of the TowerClient in the nf-tower plugin by introducing logic to detect and handle consistent failures when communicating with the Seqera Platform. If repeated failures occur over a certain time interval, the client will skip sending further trace reports to avoid unnecessary resource consumption. The changes also include corresponding tests for this new behavior.

A consistent failure is determined when more than 20 consecutive trace calls have failed, and the last success was at least 10 alive intervals ago. Nextflow will stop sending traces when failures are constantly repeated for 10-20 mins.

Failure detection and handling improvements:

  • Added failure tracking fields (failuresThreshold, failuresCount, lastSuccess) and constants to TowerClient to monitor the number of consecutive failed trace calls and the time since the last successful call. [1] [2]
  • Implemented a new method hasConsistentFailures() that determines if failures have exceeded the threshold and sufficient time has passed since the last success, triggering the skip logic.
  • Modified the main sending loop to check for consistent failures and, if detected, skip sending heartbeat and progress reports, and clear pending tasks to prevent memory issues.
  • Updated HTTP response handling to reset failure counters on success and increment them on failure, while also updating the last success timestamp. [1] [2]

Testing:

  • Added a new test in TowerClientTest to verify the consistent failure detection logic under various scenarios. [1] [2]

Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@netlify
Copy link
Copy Markdown

netlify bot commented Feb 10, 2026

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit e73aed5
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/698b77af83bed90008e79e16

Copy link
Copy Markdown
Member

@pditommaso pditommaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think this may be approached differently. I think this defensive approach (i.e. just logging the error as warns) was done in the early days, to make sure monitoring was not impacting existing pipelines.

However, i'm starting to think, when launching pipelines from/with Platform the lack of telemetry should be consider an hard failure an cause an execution error (ie. exception).

This should be relatively safe because there's already an exponential retry condition in HxClient. Therefore if there's a persistent backend it may be worth just stopping.

Thoughts?

@jorgee
Copy link
Copy Markdown
Contributor Author

jorgee commented Feb 11, 2026

Unauthorized/forbidden errors are not in the retriable list, so these errors will produce an immediate failure. However, if telemetry errors produce failures, the head job will finish early and it could reduce the cases for unknown statuses in pipelines.

Another strange behaviour related to this topic that I have seen when testing is when providing an incorrect access token. It silently ignores the telemetry even when --with-tower option is included. There is a error message in .nextflow.log but not in the stdout. I don't know the reason why it is working in that way, but in the new behaviour that @pditommaso mentioned, it should also produce an error, shouldn't it?

@pditommaso
Copy link
Copy Markdown
Member

Unauthorized/forbidden errors are not in the retriable list, so these errors will produce an immediate failure

yeah, and that's desirable. that's an error that cannot be recovered

However, if telemetry errors produce failures, the head job will finish early and it could reduce the cases for unknown statuses in pipelines.

This sounds a +1 for me

There is a error message in .nextflow.log but not in the stdout. I don't know the reason why it is working in that way, but in the new behaviour that @pditommaso mentioned, it should also produce an error, shouldn't it?

Yes, I think it should

@pditommaso
Copy link
Copy Markdown
Member

Let's add an env var to control the hard stop behaviour. When enabled (default), it stops on this error, when disable legacy (current) warning

@jorgee
Copy link
Copy Markdown
Contributor Author

jorgee commented Feb 11, 2026

Just to get it clearly, with your proposal the stop mechanism implemented in the PR in not required. We stop the pipeline or warn always. Or do you want to keep it when warning? If not required, I will create a new PR and close it once the other is ready.

@pditommaso
Copy link
Copy Markdown
Member

The new behaviour should report an error ie. stop the run, when disabled just warning as it's now

@jorgee
Copy link
Copy Markdown
Contributor Author

jorgee commented Feb 12, 2026

@pditommaso, the abort on error implemented in #6827

@jorgee
Copy link
Copy Markdown
Contributor Author

jorgee commented Apr 14, 2026

Closing. Similar behaviour is implemented in #6827

@jorgee jorgee closed this Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants