Bug: Failed to download viral library due to trailing slash in NCBI assembly_summary.txt

When using the k2 Python wrapper to download and build the `viral` library, the download step fails because it attempts to fetch a malformed URL (missing the base name, resulting in `//_genomic.fna.gz`). This issue does not occur when downloading the bacteria library.

### Error Logs

> [INFO - 2026-03-09 10:04:29,521]: Adding viral to /mnt/STK4T/kraken2/PRISM_DB
> [INFO - 2026-03-09 10:04:30,702]: Beginning download of ftp.ncbi.nlm.nih.gov/genomes/refseq/viral/assembly_summary.txt
> [INFO - 2026-03-09 10:04:33,610]: Calculating MD5 sum for /mnt/STK4T/kraken2/PRISM_DB/library/viral/assembly_summary.txt
> [INFO - 2026-03-09 10:04:33,644]: MD5 sum of /mnt/STK4T/kraken2/PRISM_DB/library/viral/assembly_summary.txt is 24700a5d3ce18bb3912f01d792576309
> [INFO - 2026-03-09 10:04:33,644]: Saved assembly_summary.txt to /mnt/STK4T/kraken2/PRISM_DB/library/viral
> [INFO - 2026-03-09 10:04:34,707]: Beginning download of ftp.ncbi.nlm.nih.gov/genomes/all/GCF/023/141/955/GCF_023141955.1_ASM2314195v1//_genomic.fna.gz
> [WARNING - 2026-03-09 10:04:34,939]: Cannot find file: genomes/all/GCF/023/141/955/GCF_023141955.1_ASM2314195v1//_genomic.fna.gz.
> Please report this issue to NCBI.
> [INFO - 2026-03-09 10:04:34,940]: Calculating MD5 sum for /mnt/STK4T/kraken2/PRISM_DB/library/viral/genomes/all/GCF/023/141/955/GCF_023141955.1_ASM2314195v1/_genomic.fna.gz
> [INFO - 2026-03-09 10:04:34,940]: MD5 sum of /mnt/STK4T/kraken2/PRISM_DB/library/viral/genomes/all/GCF/023/141/955/GCF_023141955.1_ASM2314195v1/_genomic.fna.gz is d41d8cd98f00b204e9800998ecf8427e
> [INFO - 2026-03-09 10:04:34,940]: Saved _genomic.fna.gz to /mnt/STK4T/kraken2/PRISM_DB/library/viral/genomes/all/GCF/023/141/955/GCF_023141955.1_ASM2314195v1
> [WARNING - 2026-03-09 10:04:34,941]: Unable to download genomes/all/GCF/001/964/455/GCF_001964455.1_ViralMultiSegProj362049//_genomic.fna.gz. Reason: c8
> , will try again


The bug is caused by a slight formatting difference in NCBI's `assembly_summary.txt` between different taxonomic groups, specifically the presence of a trailing slash in the `ftp_path` column for viral genomes.

In the viral summary file, the `ftp_path` ends with a slash /:

> ##See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
> #assembly_accession	bioproject	biosample	wgs_master	refseq_category	taxid	species_taxid	organism_name	infraspecific_name	isolate	version_status	assembly_level	release_type	genome_rep	seq_rel_date	asm_name	asm_submitter	gbrs_paired_asm	paired_asm_comp	ftp_path	excluded_from_refseq	relation_to_type_material	asm_not_live_date	assembly_type	group	genome_size	genome_size_ungapped	gc_percent	replicon_count	scaffold_count	contig_count	annotation_provider	annotation_name	annotation_date	total_gene_count	protein_coding_gene_count	non_coding_gene_count	pubmed_id
> GCF_000839185.1	PRJNA485481	na	na	na	10243	3431481	Cowpox virus	strain=Brighton Red	na	latest	Complete Genome	Major	Full	2003-05-19	ViralProj14174	Molecular Genetics and Microbiology, Duke University Medical Center	GCA_000839185.1	identical	https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/839/185/GCF_000839185.1_ViralProj14174/	na	ICTV species exemplar	na	haploid	viral	224499	224499	33.500000	NCBI RefSeq	Annotation submitted by NCBI RefSeq	2018-08-13	233	233	0	2014645;2309453;6961398;8091665

In the bacteria summary, there is no trailing slash:

> ##See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.
> #assembly_accession	bioproject	biosample	wgs_master	refseq_category	taxid	species_taxid	organism_name	infraspecific_name	isolate	version_status	assembly_level	release_type	genome_rep	seq_rel_date	asm_name	asm_submitter	gbrs_paired_asm	paired_asm_comp	ftp_path	excluded_from_refseq	relation_to_type_material	asm_not_live_date	assembly_type	group	genome_size	genome_size_ungapped	gc_percent	replicon_count	scaffold_count	contig_count	annotation_provider	annotation_name	annotation_date	total_gene_count	protein_coding_gene_count	non_coding_gene_count	pubmed_id
> GCF_036600855.1	PRJNA224116	SAMN38772065	JBAFXD000000000.1	na	7	7	Azorhizobium caulinodans	strain=CNM20190194	na	latest	Contig	Major	Full	2024-02-14	ASM3660085v1	Instituto de Salud Carlos III	GCA_036600855.1	identical	https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/036/600/855/GCF_036600855.1_ASM3660085v1	na	na	na	haploid	bacteria	5666542	5666542	67.000000	0	171	171	NCBI RefSeq	GCF_036600855.1-RS_2026_02_13	2026-02-13	5328	5192	58	na

### Code Walkthrough
The download failure stems from how the `make_manifest_from_assembly_summary` function processes the `ftp_path` column from NCBI's `assembly_summary.txt`.
``` python
def make_manifest_from_assembly_summary(args, assembly_summary_file):
    # ... 
    suffix = "_protein.faa.gz" if args.protein else "_genomic.fna.gz"
    
    for line in assembly_summary_file:
        # ...
        fields = line.strip().split("\t")
        
        # 1. Extraction
        taxid, asm_level, ftp_path = fields[5], fields[11], fields[19] 
        
        # ...
        # 2. String Concatenation (The Bug)
        remote_path = ftp_path + "/" + os.path.basename(ftp_path) + suffix
```

`os.path.basename(".../GCF_000839185.1_ViralProj14174/")` evaluates to an empty string `""`. Consequently, when the script constructs the `remote_path` in `make_manifest_from_assembly_summary`, it concatenates the URL without the base name, leading to the `//_genomic.fna.gz` path.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Failed to download viral library due to trailing slash in NCBI assembly_summary.txt #1014

Error Logs

Code Walkthrough

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Bug: Failed to download viral library due to trailing slash in NCBI assembly_summary.txt #1014

Description

Error Logs

Code Walkthrough

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions