Skip to content

Recalculate sequence slice coordinates after gap clipping #119

@vagkaratzas

Description

@vagkaratzas

Description of feature

When the clipping mode is on (trim_msa, trim_ends_only), tha gappy ends of an alignment can be removed, so in the end an initial input sequence such as sequence_name may actually become sequence_name/5-147. The pipeline needs a new local module that would string parse and recalculate the updated sequence coordinates, every time a clipping tools is used. During parsing, there can be two cases; either the sequence does not originally contain any /, so the new start and end must be calculated based on its match against the original sequence, or it was already a slice of a sequence (contains /), so its start and end must be recalculated based on those.

However, this does not make any sense when trim_msa is true and trim_ends_only is false (not advisable to do so), because in this case, gaps in the middle of the sequences (not just at the ends) can also be removed, and then the meaning of the initial sequence range is lost. The logic flow should be controlled accordingly with conditional statements based on these parameters.

To avoid extra calculations mid-execution, instead of calculating the new coords every time after clipping (may happen in different places with future updates), a thought is to calcuate the actual coordinates once, before outputting the final FASTA and MSA files of the pipeline execution. However, we will need to make sure that no intermediate files with wrong coordinates are saved, based on user selected parameters.

To also output the region (/start-end) in the family reps _mqc file.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions