Skip to content

clusterupdate of long nucleotide sequences: any performance advice? #1059

@bdelepine

Description

@bdelepine

Hi all,

I need some help to tune a clusterupdate job.

I'm interested in the rough clustering of many (~5M) long (max 400kb) very similar (~98% identity, 95% coverage) nucleotide sequences. I want to be able to update this clustering with new sequences and keep the same representatives.

My best attempt is to:

mmseqs clusterupdate sequenceDB sequenceDBcandidate clusterDB sequenceDBupdated clusterDBupdated tmpDir --max-seq-len 400000 --cov-mode 0 -c 0.95 --min-seq-id 0.98 --search-type 3

On a dataset with sequenceDB=237557 and sequenceDBcandidate=3572432, I run a job (with 80Gb of RAM, i.e. more than the Estimated memory consumption, and 50 cores) that timed-out after 3 days. I was surprised to see that the logs indicate only ~1m of compute time for Index table: fill step although it seemed to me that it was stuck there for a rather long time.

There must be something I do not understand; could you put me on the right track?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions