Catch specific slurm errors?

E.g. exit code of 0:53 from slurm means an I/O error, and we might not get an outlog to debug this. Any way we could capture this info to make it more visible? Goal would be to easily distinguish jobs that failed due to cluster problem and just need a re-run, vs jobs that failed due to internal issues and need to have the outlog reviewed.

If this is captured in the qcstatus on xnat, we could automatically reset based on qcstatus, or similar. Either command line tool, or some process (or extra cron job) that restarted cluster related failures automatically (up to X times, to avoid infinite submissions).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catch specific slurm errors? #536

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Catch specific slurm errors? #536

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions