Skip to content

vg autoindex should not chop nodes to be smaller than the documented maximum length of 1024 bp by default #4829

@Knight-JChev

Description

@Knight-JChev

The title of the issue is the deduction I made from my tests, explained below.

1. What were you trying to do?
Count, for each aligned GAF segment the average number of bases aligned on it, using the GFA to retrieve segment lenghts.

2. What did you want to happen?
I wanted every GAF segments to be present in the GFA.

3. What actually happened?
Some GAF segments were not present in the GFA. Either the segments came in-between two GFA segments (e.g. segment 58 in GAF but only 57 and 59 in GFA) or 'out-of-bound' segments (e.g. segment 1,000 in GAF where the "last" segment in the GFA is 700).

4. What data and command can the vg dev team use to make the problem happen?
From my tests it should be reproducible with any data. I tried on 3 different plant graphs (two clipped I made and one full I was provided).
I made the GFA with Minigraph-Cactus. (cactus v3.0.1; minigraph 0.21-r606)

cactus-pangenome ./js /path/to/seqfile_local --outDir ./PG_graph --outName PG --reference Ref 
    --vcf --viz --giraffe --gfa --gbz --stats --mgCores 30 --mapCores 30 --consCores 30
  • Unzip the produced gfa.gz file.
  • Use vg autoindex to obtain the indexes needed for vg giraffe (issue happens for both sr-giraffe and lr-giraffe workflows, with --parameter-preset r10 for long-reads).
  • Align using giraffe with the newly-generated index files.
gunzip PG.gfa.gz  # Not sure if VG manages compressed gfa files
vg autoindex --no-guessing  --workflow sr-giraffe  --prefix PG  --gfa PG.gfa
vg giraffe \
    -t 4 -o GAF --progress \
    -Z ./PG.giraffe.gbz\
    -m ./PG.shortread.withzip.min\
    -z ./PG.shortread.zipcodes\
    -d ./PG.dist\
    -f ./fancy_reads.fastq\
    > ./giraffe_aln_SR_PG.gaf

Check for GAF segments not present in GFA. I made a python script that I can share if you wish but honestly, a quick look into the files will tell you.

5. What does running vg version say?
I downloaded the image from quay.io.

  singularity pull docker://quay.io/vgteam/vg:v1.72.0
vg version v1.72.0 "Littlefoot"
Compiled with g++ (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 on Linux
Linked against libstd++ 20230528
Using HTSlib headers 101990, library 1.19.1
Built by root@buildkitsandbox

6. Personnal notes, and a bit of light
When I constructed the GFA, I've used the --giraffe embedded in cactus-pangenome. It has produced a .d2.gbz file along with a .gbz file, which are indexes for the filtered and clipped graph respectively if I'm not mistaken. Interestingly, the phenomenon I've mentioned above do not happen when using these indexes.
Also, it seems that only the .gbz files are affected by this issue, not the other indexes. I tested the minimizer, distance and zipcode indexes generated with the GFA file along with the .gbz generated by cactus-pangenome. I observed the same results as with all the indexes coming directly from cactus-pangenome.

Please inform me if I did something wrong or if you need any other information.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions