Skip to content

Bug: Failed to download viral library due to trailing slash in NCBI assembly_summary.txt #1014

@inspirewind

Description

@inspirewind

When using the k2 Python wrapper to download and build the viral library, the download step fails because it attempts to fetch a malformed URL (missing the base name, resulting in //_genomic.fna.gz). This issue does not occur when downloading the bacteria library.

Error Logs

[INFO - 2026-03-09 10:04:29,521]: Adding viral to /mnt/STK4T/kraken2/PRISM_DB
[INFO - 2026-03-09 10:04:30,702]: Beginning download of ftp.ncbi.nlm.nih.gov/genomes/refseq/viral/assembly_summary.txt
[INFO - 2026-03-09 10:04:33,610]: Calculating MD5 sum for /mnt/STK4T/kraken2/PRISM_DB/library/viral/assembly_summary.txt
[INFO - 2026-03-09 10:04:33,644]: MD5 sum of /mnt/STK4T/kraken2/PRISM_DB/library/viral/assembly_summary.txt is 24700a5d3ce18bb3912f01d792576309
[INFO - 2026-03-09 10:04:33,644]: Saved assembly_summary.txt to /mnt/STK4T/kraken2/PRISM_DB/library/viral
[INFO - 2026-03-09 10:04:34,707]: Beginning download of ftp.ncbi.nlm.nih.gov/genomes/all/GCF/023/141/955/GCF_023141955.1_ASM2314195v1//_genomic.fna.gz
[WARNING - 2026-03-09 10:04:34,939]: Cannot find file: genomes/all/GCF/023/141/955/GCF_023141955.1_ASM2314195v1//_genomic.fna.gz.
Please report this issue to NCBI.
[INFO - 2026-03-09 10:04:34,940]: Calculating MD5 sum for /mnt/STK4T/kraken2/PRISM_DB/library/viral/genomes/all/GCF/023/141/955/GCF_023141955.1_ASM2314195v1/_genomic.fna.gz
[INFO - 2026-03-09 10:04:34,940]: MD5 sum of /mnt/STK4T/kraken2/PRISM_DB/library/viral/genomes/all/GCF/023/141/955/GCF_023141955.1_ASM2314195v1/_genomic.fna.gz is d41d8cd98f00b204e9800998ecf8427e
[INFO - 2026-03-09 10:04:34,940]: Saved _genomic.fna.gz to /mnt/STK4T/kraken2/PRISM_DB/library/viral/genomes/all/GCF/023/141/955/GCF_023141955.1_ASM2314195v1
[WARNING - 2026-03-09 10:04:34,941]: Unable to download genomes/all/GCF/001/964/455/GCF_001964455.1_ViralMultiSegProj362049//_genomic.fna.gz. Reason: c8
, will try again

The bug is caused by a slight formatting difference in NCBI's assembly_summary.txt between different taxonomic groups, specifically the presence of a trailing slash in the ftp_path column for viral genomes.

In the viral summary file, the ftp_path ends with a slash /:

##See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
#assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name asm_submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date assembly_type group genome_size genome_size_ungapped gc_percent replicon_count scaffold_count contig_count annotation_provider annotation_name annotation_date total_gene_count protein_coding_gene_count non_coding_gene_count pubmed_id
GCF_000839185.1 PRJNA485481 na na na 10243 3431481 Cowpox virus strain=Brighton Red na latest Complete Genome Major Full 2003-05-19 ViralProj14174 Molecular Genetics and Microbiology, Duke University Medical Center GCA_000839185.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/839/185/GCF_000839185.1_ViralProj14174/ na ICTV species exemplar na haploid viral 224499 224499 33.500000 NCBI RefSeq Annotation submitted by NCBI RefSeq 2018-08-13 233 233 0 2014645;2309453;6961398;8091665

In the bacteria summary, there is no trailing slash:

##See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
#assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name asm_submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date assembly_type group genome_size genome_size_ungapped gc_percent replicon_count scaffold_count contig_count annotation_provider annotation_name annotation_date total_gene_count protein_coding_gene_count non_coding_gene_count pubmed_id
GCF_036600855.1 PRJNA224116 SAMN38772065 JBAFXD000000000.1 na 7 7 Azorhizobium caulinodans strain=CNM20190194 na latest Contig Major Full 2024-02-14 ASM3660085v1 Instituto de Salud Carlos III GCA_036600855.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/036/600/855/GCF_036600855.1_ASM3660085v1 na na na haploid bacteria 5666542 5666542 67.000000 0 171 171 NCBI RefSeq GCF_036600855.1-RS_2026_02_13 2026-02-13 5328 5192 58 na

Code Walkthrough

The download failure stems from how the make_manifest_from_assembly_summary function processes the ftp_path column from NCBI's assembly_summary.txt.

def make_manifest_from_assembly_summary(args, assembly_summary_file):
    # ... 
    suffix = "_protein.faa.gz" if args.protein else "_genomic.fna.gz"
    
    for line in assembly_summary_file:
        # ...
        fields = line.strip().split("\t")
        
        # 1. Extraction
        taxid, asm_level, ftp_path = fields[5], fields[11], fields[19] 
        
        # ...
        # 2. String Concatenation (The Bug)
        remote_path = ftp_path + "/" + os.path.basename(ftp_path) + suffix

os.path.basename(".../GCF_000839185.1_ViralProj14174/") evaluates to an empty string "". Consequently, when the script constructs the remote_path in make_manifest_from_assembly_summary, it concatenates the URL without the base name, leading to the //_genomic.fna.gz path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions