Skip to content

Inflated size of corrected reads compared to input fasta #26

@AnnabelWhibley

Description

@AnnabelWhibley

Hello Pierre,

I am running CONSENT-correct on a 20x PacBio dataset for a 1Gb genome. The version was cloned from your git repository on the 18th Feb 2021 (i.e. the most current version).

It has yet to complete, but in the process of trying to figure out how close to done it might be, I have been checking the output.
There are 3.2M uncorrected reads in the dataset but over 13M corrected reads so far written to the corrected.fasta. This is not simply a case of reads being split as there are 23Gb of sequence in the input dataset and >82Gb in the output.

I have seen some indications in the issues thread that this behaviour has been seen before but I would value your opinion on if/how I can salvage something from this run (or how to avoid this problem on repeating).

I checked for header uniqueness by sorting output and running uniq and find that the inflation can be explained by this.

Many thanks,
Annabel

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions