Inflated size of corrected reads compared to input fasta

Hello Pierre,  

I am running CONSENT-correct on a 20x PacBio dataset for a 1Gb genome. The version was cloned from your git repository on the 18th Feb 2021 (i.e. the most current version). 

It has yet to complete, but in the process of trying to figure out how close to done it might be, I have been checking the output. 
There are  3.2M uncorrected reads in the dataset but over 13M corrected reads so far written to the corrected.fasta. This is not simply a case of reads being split as there are 23Gb of sequence in the input dataset and >82Gb in the output. 

I have seen some indications in the issues thread that this behaviour has been seen before but I would value your opinion on if/how I can salvage something from this run (or how to avoid this problem on repeating).

I checked for header uniqueness by sorting output and running `uniq` and find that the inflation can be explained by this.

Many thanks,
Annabel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inflated size of corrected reads compared to input fasta #26

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Inflated size of corrected reads compared to input fasta #26

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions