Hanging processes

Hello, 

I am running ppanggolin on about 40K species, on a slurm managed cluster, and after some time the multi-threaded processes hang indefinitely.
My initial thinking was that the total memory used by all the processes exceeded the available memory, and indeed by lowering the number of CPUs I was able to run more of them. But still after some time it ends up hanging, no matter the number of genomes in the input.

This is how I'm using ppanggolin: 

` readarray -t LIST < pangenome_out_ALL/list_chunks_2.txt`
`FILE=${LIST[$SLURM_ARRAY_TASK_ID]}
`

`for chunk in $(cat pangenome_out_ALL/chunk_lists_2/${FILE}) ; do 
`
`   if ! ppanggolin panmodule --anno list_gtdb_species_3/species_metadat_IMG_GTDB_fixed-$chunk.tsv -c 16 -o pangenome_out_ALL/output/$chunk --clusters clu_gtdb_species_ALL/prot_clu-$chunk.tsv --infer_singletons --rarefaction ; then
`
`    echo "########### failed for $chunk"
`
`    rm -r pangenome_out_ALL/output/$chunk
`
`    fi
`
`done`

I was hoping that failed jobs would not hang, thus using a for loop...

This is the output of what happens before I cancel because it's hanging:

` 66%|██████▌   | 356/540 [00:05<00:02, 67.92samples partitioned/s]2024-11-05 12:04:20 partition.py:l226 WARNING       Partitioning did not work (the number of genomes used is probably too low), see logs here to obtain more details /tmp/tmpfv6pgcr0/17
 89%|████████▉ | 481/540 [00:07<00:01, 58.32samples partitioned/s]2024-11-05 12:04:22 partition.py:l226 WARNING       Partitioning did not work (the number of genomes used is probably too low), see logs here to obtain more details /tmp/tmpfv6pgcr0/54
 91%|█████████ | 492/540 [00:07<00:00, 69.42samples partitioned/s]2024-11-05 12:04:22 partition.py:l226 WARNING       Partitioning did not work (the number of genomes used is probably too low), see logs here to obtain more details /tmp/tmpfv6pgcr0/56
100%|█████████▉| 539/540 [00:18<00:00, 43.59samples partitioned/s]slurmstepd: error: *** JOB 12448523 ON n0052.dori0 CANCELLED AT 2024-11-06T09:42:15 ***`

This is what I see when `top -u myusername` when I log in to the node: 
![image](https://github.com/user-attachments/assets/8fdec567-1683-42c0-ad04-98c41e8428b5)

Is this something normal ? Have you ever experienced this ? 
I have no idea what causes the problem or how to fix it, and I'm not really sure if it's relevant to create this issue, I apologize if it's not
The only solution I see is setting a timeout, but since I have different sizes of species it is hard to pick a time that's not too long or too short

Thanks in advance, 
Eric

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hanging processes #300

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Hanging processes #300

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions