-
Notifications
You must be signed in to change notification settings - Fork 300
Description
When using the k2 Python wrapper to download and build the viral library, the download step fails because it attempts to fetch a malformed URL (missing the base name, resulting in //_genomic.fna.gz). This issue does not occur when downloading the bacteria library.
Error Logs
[INFO - 2026-03-09 10:04:29,521]: Adding viral to /mnt/STK4T/kraken2/PRISM_DB
[INFO - 2026-03-09 10:04:30,702]: Beginning download of ftp.ncbi.nlm.nih.gov/genomes/refseq/viral/assembly_summary.txt
[INFO - 2026-03-09 10:04:33,610]: Calculating MD5 sum for /mnt/STK4T/kraken2/PRISM_DB/library/viral/assembly_summary.txt
[INFO - 2026-03-09 10:04:33,644]: MD5 sum of /mnt/STK4T/kraken2/PRISM_DB/library/viral/assembly_summary.txt is 24700a5d3ce18bb3912f01d792576309
[INFO - 2026-03-09 10:04:33,644]: Saved assembly_summary.txt to /mnt/STK4T/kraken2/PRISM_DB/library/viral
[INFO - 2026-03-09 10:04:34,707]: Beginning download of ftp.ncbi.nlm.nih.gov/genomes/all/GCF/023/141/955/GCF_023141955.1_ASM2314195v1//_genomic.fna.gz
[WARNING - 2026-03-09 10:04:34,939]: Cannot find file: genomes/all/GCF/023/141/955/GCF_023141955.1_ASM2314195v1//_genomic.fna.gz.
Please report this issue to NCBI.
[INFO - 2026-03-09 10:04:34,940]: Calculating MD5 sum for /mnt/STK4T/kraken2/PRISM_DB/library/viral/genomes/all/GCF/023/141/955/GCF_023141955.1_ASM2314195v1/_genomic.fna.gz
[INFO - 2026-03-09 10:04:34,940]: MD5 sum of /mnt/STK4T/kraken2/PRISM_DB/library/viral/genomes/all/GCF/023/141/955/GCF_023141955.1_ASM2314195v1/_genomic.fna.gz is d41d8cd98f00b204e9800998ecf8427e
[INFO - 2026-03-09 10:04:34,940]: Saved _genomic.fna.gz to /mnt/STK4T/kraken2/PRISM_DB/library/viral/genomes/all/GCF/023/141/955/GCF_023141955.1_ASM2314195v1
[WARNING - 2026-03-09 10:04:34,941]: Unable to download genomes/all/GCF/001/964/455/GCF_001964455.1_ViralMultiSegProj362049//_genomic.fna.gz. Reason: c8
, will try again
The bug is caused by a slight formatting difference in NCBI's assembly_summary.txt between different taxonomic groups, specifically the presence of a trailing slash in the ftp_path column for viral genomes.
In the viral summary file, the ftp_path ends with a slash /:
##See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
#assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name asm_submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date assembly_type group genome_size genome_size_ungapped gc_percent replicon_count scaffold_count contig_count annotation_provider annotation_name annotation_date total_gene_count protein_coding_gene_count non_coding_gene_count pubmed_id
GCF_000839185.1 PRJNA485481 na na na 10243 3431481 Cowpox virus strain=Brighton Red na latest Complete Genome Major Full 2003-05-19 ViralProj14174 Molecular Genetics and Microbiology, Duke University Medical Center GCA_000839185.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/839/185/GCF_000839185.1_ViralProj14174/ na ICTV species exemplar na haploid viral 224499 224499 33.500000 NCBI RefSeq Annotation submitted by NCBI RefSeq 2018-08-13 233 233 0 2014645;2309453;6961398;8091665
In the bacteria summary, there is no trailing slash:
##See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
#assembly_accession bioproject biosample wgs_master refseq_category taxid species_taxid organism_name infraspecific_name isolate version_status assembly_level release_type genome_rep seq_rel_date asm_name asm_submitter gbrs_paired_asm paired_asm_comp ftp_path excluded_from_refseq relation_to_type_material asm_not_live_date assembly_type group genome_size genome_size_ungapped gc_percent replicon_count scaffold_count contig_count annotation_provider annotation_name annotation_date total_gene_count protein_coding_gene_count non_coding_gene_count pubmed_id
GCF_036600855.1 PRJNA224116 SAMN38772065 JBAFXD000000000.1 na 7 7 Azorhizobium caulinodans strain=CNM20190194 na latest Contig Major Full 2024-02-14 ASM3660085v1 Instituto de Salud Carlos III GCA_036600855.1 identical https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/036/600/855/GCF_036600855.1_ASM3660085v1 na na na haploid bacteria 5666542 5666542 67.000000 0 171 171 NCBI RefSeq GCF_036600855.1-RS_2026_02_13 2026-02-13 5328 5192 58 na
Code Walkthrough
The download failure stems from how the make_manifest_from_assembly_summary function processes the ftp_path column from NCBI's assembly_summary.txt.
def make_manifest_from_assembly_summary(args, assembly_summary_file):
# ...
suffix = "_protein.faa.gz" if args.protein else "_genomic.fna.gz"
for line in assembly_summary_file:
# ...
fields = line.strip().split("\t")
# 1. Extraction
taxid, asm_level, ftp_path = fields[5], fields[11], fields[19]
# ...
# 2. String Concatenation (The Bug)
remote_path = ftp_path + "/" + os.path.basename(ftp_path) + suffixos.path.basename(".../GCF_000839185.1_ViralProj14174/") evaluates to an empty string "". Consequently, when the script constructs the remote_path in make_manifest_from_assembly_summary, it concatenates the URL without the base name, leading to the //_genomic.fna.gz path.