Skip to content

Catch specific slurm errors? #536

@baxpr

Description

@baxpr

E.g. exit code of 0:53 from slurm means an I/O error, and we might not get an outlog to debug this. Any way we could capture this info to make it more visible? Goal would be to easily distinguish jobs that failed due to cluster problem and just need a re-run, vs jobs that failed due to internal issues and need to have the outlog reviewed.

If this is captured in the qcstatus on xnat, we could automatically reset based on qcstatus, or similar. Either command line tool, or some process (or extra cron job) that restarted cluster related failures automatically (up to X times, to avoid infinite submissions).

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions