-
Notifications
You must be signed in to change notification settings - Fork 262
Description
Hi all,
I need some help to tune a clusterupdate job.
I'm interested in the rough clustering of many (~5M) long (max 400kb) very similar (~98% identity, 95% coverage) nucleotide sequences. I want to be able to update this clustering with new sequences and keep the same representatives.
My best attempt is to:
mmseqs clusterupdate sequenceDB sequenceDBcandidate clusterDB sequenceDBupdated clusterDBupdated tmpDir --max-seq-len 400000 --cov-mode 0 -c 0.95 --min-seq-id 0.98 --search-type 3
On a dataset with sequenceDB=237557 and sequenceDBcandidate=3572432, I run a job (with 80Gb of RAM, i.e. more than the Estimated memory consumption, and 50 cores) that timed-out after 3 days. I was surprised to see that the logs indicate only ~1m of compute time for Index table: fill step although it seemed to me that it was stuck there for a rather long time.
There must be something I do not understand; could you put me on the right track?
Thanks!