clusterupdate of long nucleotide sequences: any performance advice?

Hi all,

I need some help to tune a clusterupdate job.

I'm interested in the rough clustering of many (~5M) long (max 400kb) very similar (~98% identity, 95% coverage) nucleotide sequences. I want to be able to update this clustering with new sequences and keep the same representatives.

My best attempt is to:
```
mmseqs clusterupdate sequenceDB sequenceDBcandidate clusterDB sequenceDBupdated clusterDBupdated tmpDir --max-seq-len 400000 --cov-mode 0 -c 0.95 --min-seq-id 0.98 --search-type 3
```

On a dataset with sequenceDB=237557 and sequenceDBcandidate=3572432, I run a job (with 80Gb of RAM, i.e. more than the `Estimated memory consumption`, and 50 cores) that timed-out after 3 days. I was surprised to see that the logs indicate only ~1m of compute time for  `Index table: fill` step although it seemed to me that it was stuck there for a rather long time.

There must be something I do not understand; could you put me on the right track?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

clusterupdate of long nucleotide sequences: any performance advice? #1059

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

clusterupdate of long nucleotide sequences: any performance advice? #1059

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions