E.g. exit code of 0:53 from slurm means an I/O error, and we might not get an outlog to debug this. Any way we could capture this info to make it more visible? Goal would be to easily distinguish jobs that failed due to cluster problem and just need a re-run, vs jobs that failed due to internal issues and need to have the outlog reviewed.
If this is captured in the qcstatus on xnat, we could automatically reset based on qcstatus, or similar. Either command line tool, or some process (or extra cron job) that restarted cluster related failures automatically (up to X times, to avoid infinite submissions).
E.g. exit code of 0:53 from slurm means an I/O error, and we might not get an outlog to debug this. Any way we could capture this info to make it more visible? Goal would be to easily distinguish jobs that failed due to cluster problem and just need a re-run, vs jobs that failed due to internal issues and need to have the outlog reviewed.
If this is captured in the qcstatus on xnat, we could automatically reset based on qcstatus, or similar. Either command line tool, or some process (or extra cron job) that restarted cluster related failures automatically (up to X times, to avoid infinite submissions).