A Nextflow pipeline for phage discovery from metagenomic assemblies.
Takes a combined contig FASTA and runs viral identification, quality assessment, annotation, and classification end-to-end.
contigs.fasta
│
├─ geNomad viral classification + provirus detection
├─ R filter DTR topology always kept; Provirus filtered by score
├─ CheckV completeness + quality estimation
├─ R filter quality-based filtering with DTR bypass
├─ Pharokka phage genome annotation (MMseqs2 + PyHMMER)
│
└─ PhaBOX* taxonomy, lifestyle, host, vOTU clustering, phylogenetic tree
(*optional — requires separate installation)
- Nextflow >= 23.04
- One of:
- conda / mamba, or
- Docker (laptop/workstation), or
- Singularity / Apptainer (HPC)
phinder needs two things: the reference databases (downloaded once) and a way
to provision the tools (a -profile — containers or conda).
1. Download the databases
curl -O https://raw.githubusercontent.com/andrewbudge/phinder/main/setup.sh
bash setup.sh --skip-envs # databases only → ~/.phinder_dbsDrop --skip-envs to also build pinned conda environments (genomad_phinder,
checkv_phinder, pharokka_phinder) — only needed for the conda profile below.
Add --with-phabox2 to set up the optional PhaBOX step.
2. Run
With Docker — pinned images, nothing to build (recommended):
nextflow run andrewbudge/phinder \
--input contigs.fasta \
--genomad_db ~/.phinder_dbs/genomad_db \
--checkv_db ~/.phinder_dbs/checkv_db \
--pharokka_db ~/.phinder_dbs/pharokka_db \
-profile dockerWith conda — uses the envs built by setup.sh (run it without --skip-envs):
nextflow run andrewbudge/phinder \
--input contigs.fasta \
--genomad_db ~/.phinder_dbs/genomad_db \
--checkv_db ~/.phinder_dbs/checkv_db \
--pharokka_db ~/.phinder_dbs/pharokka_db \
--genomad_env $(conda info --base)/envs/genomad_phinder \
--checkv_env $(conda info --base)/envs/checkv_phinder \
--pharokka_env $(conda info --base)/envs/pharokka_phinder \
-profile condaUse -resume on reruns to skip completed steps. For HPC / Singularity / Apptainer,
see Execution profiles.
Reproducible runs: pin a released version with
-r, e.g.nextflow run andrewbudge/phinder -r v0.2.0 .... Without-r, Nextflow tracks the default branch (main), which moves. Pin-r, record the DB versions fromDB_MANIFEST.tsv, and you can reproduce a run exactly.
Before running on real data, confirm Nextflow and your tool profile are wired up correctly — no databases required. This runs the whole pipeline on a tiny bundled dataset in stub mode, where every step emits placeholder outputs in seconds:
nextflow run andrewbudge/phinder -profile test -stub-runA [SUCCESS] line with all 8 processes completed means your setup is good. To
also check that your engine pulls images / builds envs, add it to the profile:
nextflow run andrewbudge/phinder -profile test,docker -stub-run # or test,condaThis is the same check phinder's CI runs on every change.
A single combined contig FASTA — uncompressed or gzipped. If working with multiple samples, merge and rename headers before passing to phinder to avoid ID collisions:
# Example: prefix each sample's contigs with its ID
zcat sample1_contigs.fasta.gz | sed 's/>/>sample1_/' >> combined.fasta
zcat sample2_contigs.fasta.gz | sed 's/>/>sample2_/' >> combined.fastaCoverage filtering, circular contig extraction, and other pre-processing are
left to the user — phinder is assembler-agnostic and makes no assumptions about
header format. If your assembler embeds coverage=N in headers (e.g. metaMDBG),
phinder will parse it and flag low-coverage candidates.
results/
├── genomad/
│ ├── filtered_genomad.tsv filtered geNomad hits
│ └── output/ full geNomad output
├── checkv/
│ ├── potential_phage.tsv final candidate table
│ └── output/ full CheckV output
├── candidates/
│ └── candidate_phages.fna candidate phage sequences
├── pharokka/
│ └── output/ per-contig annotation files
├── phabox/ (if --phabox2_env provided)
│ ├── end_to_end/ taxonomy + lifestyle + host predictions
│ ├── votu/ AAI-based vOTU clusters
│ └── tree/ phylogenetic tree (terl + portal markers)
└── pipeline_info/
└── versions.yml tool versions used in this run
pipeline_info/versions.yml records the version of every tool the run
invoked — pair it with the database DB_MANIFEST.tsv to fully describe a run
when reporting results.
| Parameter | Default | Description |
|---|---|---|
--input |
required | Combined contig FASTA (.fa or .fa.gz) |
--outdir |
results |
Output directory |
--cpu_fraction |
0.5 |
Fraction of detected CPU cores to use (1.0 = all) |
--mem_fraction |
0.5 |
Fraction of detected RAM to use (1.0 = all) |
--max_cpus |
unset | Exact core budget; overrides --cpu_fraction |
--max_memory |
unset | Exact memory budget, e.g. '64.GB'; overrides --mem_fraction |
--avail_cpus |
auto-detected | Override detected core count (if auto-detection is wrong) |
--avail_mem |
auto-detected | Override detected RAM in bytes (if auto-detection is wrong) |
--genomad_db |
required | Path to geNomad database |
--checkv_db |
required | Path to CheckV database |
--pharokka_db |
required | Path to Pharokka database |
--phabox_db |
required if --phabox2_env set |
Path to PhaBOX database |
--genomad_env |
builds from envs/genomad.yml |
Path to existing geNomad conda env |
--checkv_env |
builds from envs/checkv.yml |
Path to existing CheckV conda env |
--pharokka_env |
builds from envs/pharokka.yml |
Path to existing Pharokka conda env |
--phabox2_env |
unset (PhaBOX skipped) | Path to existing phabox2 conda env |
--min_provirus_score |
0.9 |
geNomad Provirus minimum virus_score |
--checkv_quality_keep |
High-quality,Complete |
Comma-separated CheckV quality tiers to keep |
--min_coverage |
3 |
Coverage threshold for low-coverage flagging |
--genomad_splits |
20 |
geNomad MMseqs2 database splits (lower = faster, higher RAM) |
--pharokka_gene_predictor |
prodigal-gv |
Gene predictor for Pharokka |
--phabox_skip_phamer |
true |
Skip Phamer (sequences already confirmed viral) |
--phabox_votu_mode |
AAI |
vOTU clustering mode (AAI or ANI) |
--phabox_tree_markers |
terl,portal |
Marker genes for phylogenetic tree |
geNomad filter:
- DTR topology → always kept (hallmark of complete linear phage genome)
- Provirus topology → kept if
virus_score >= --min_provirus_score - All other topologies → dropped
CheckV filter:
- Kept if
checkv_qualityin--checkv_quality_keep - DTR contigs bypass CheckV quality thresholds (CheckV under-scores DTRs due to absent host flanking regions)
- Low coverage is flagged (
low_coverage = TRUE) but not dropped
Pick how tools are provisioned with -profile. Profiles are composable
(comma-separated):
| Profile | Tools provided by | Use when |
|---|---|---|
conda |
conda envs built from envs/*.yml |
local conda/mamba install |
mamba |
same, resolved with mamba | faster conda solves |
docker |
pinned biocontainer images | laptop / workstation |
singularity |
same images, via Singularity | HPC without root |
apptainer |
same images, via Apptainer | HPC without root |
slurm |
(executor only) | submit processes as SLURM jobs |
Containers pull pinned, frozen tool images — no conda solve, identical on every
machine. The reference databases are still downloaded separately (via
setup.sh) regardless of profile, and passed with the --*_db flags; Nextflow
mounts them into the container automatically.
# Laptop, with Docker
nextflow run andrewbudge/phinder --input contigs.fasta \
--genomad_db ... --checkv_db ... --pharokka_db ... \
-profile docker
# HPC, Singularity images submitted as SLURM jobs
nextflow run andrewbudge/phinder ... -profile singularity,slurmPhaBOX (optional) currently has no container image and runs via conda only. Combine
-profile dockerwith--phabox2_env <env>if you need it, or omit PhaBOX under the container profiles.
By default phinder uses half your machine — half the detected CPU cores and
half the RAM. This is deliberately conservative: it runs out of the box on a
laptop or a small VM without ever failing with "process requirement exceeds
available CPUs/memory", and it's polite on a shared login node. Work is split
across three tiers (process_low/medium/high); the heavy steps (geNomad,
CheckV) get the full budget, lighter steps get a share.
To go faster, give it more — two ways:
# By fraction — run on a bigger machine and use more of it
nextflow run andrewbudge/phinder ... --cpu_fraction 1.0 --mem_fraction 1.0 # the whole machine
nextflow run andrewbudge/phinder ... --cpu_fraction 0.75 --mem_fraction 0.75 # leave some headroom
# By exact amount — ultimate control (overrides the fractions)
nextflow run andrewbudge/phinder ... --max_cpus 32 --max_memory '128.GB'--cpu_fraction/--max_cpus and --mem_fraction/--max_memory are
independent — e.g. cap memory at an exact --max_memory '64.GB' while letting
cores stay at the default fraction. The chosen budget is what the heavy steps
(geNomad, CheckV) get; lighter steps take a share.
If auto-detection guesses wrong (e.g. inside a container with cgroup limits, or
on a scheduler that hides the true node size), set the machine size explicitly
with --avail_cpus N and --avail_mem <bytes>. For per-step control, drop in a
-c custom.config overriding the process_low/medium/high labels.
geNomad memory scales with
--genomad_splits(default20). If you hit a memory wall, raise it; if you have RAM to spare and want speed, lower it.
Add -profile slurm to submit processes as SLURM jobs (compose with a tool
profile, e.g. conda or singularity):
nextflow run andrewbudge/phinder ... -profile singularity,slurmOn a scheduler each process is sized by its process_low/medium/high label.
Tune them for your partition with a -c custom.config — for example:
process {
withLabel: process_high { cpus = 32; memory = 128.GB; time = '24h' }
}bash setup.sh [--db-dir DIR] [--skip-envs] [--skip-dbs] [--with-phabox2]
--db-dir DIR Database directory (default: ~/.phinder_dbs)
--skip-envs Skip conda environment creation
--skip-dbs Skip database downloads
--with-phabox2 Also install the phabox2 conda environment
phinder does not pin database versions — it orchestrates the underlying tools
and lets each one fetch its current database. That keeps you on the versions the
tool authors recommend, which is what most analyses want. If you need a specific
version instead, download it yourself and point the matching --*_db flag at it.
So that a run can still be described after the fact, setup.sh writes a
DB_MANIFEST.tsv into the database directory recording each database's version,
the tool version that fetched it, its source, and the date:
database db_version tool_version source recorded_utc
genomad 1.9 1.11.2 genomad download-database 2026-06-07T22:23:38Z
checkv 1.5 1.0.3 checkv download_database 2026-06-07T22:23:38Z
pharokka 1.8.0 1.8.2 install_databases.py 2026-06-07T22:23:38Z
phabox 2.2 2.2 github.com/.../phabox_db_v2_2.zip 2026-06-07T22:23:38Z
Re-running setup.sh refreshes the entries for whatever databases are present,
so the manifest always reflects what is on disk. Include it when reporting
results or filing issues.
If you use phinder in your work, please cite the underlying tools:
- geNomad: Camargo, A.P., Roux, S., Schulz, F. et al. Identification of mobile genetic elements with geNomad. Nat Biotechnol 42, 1303–1312 (2024). https://doi.org/10.1038/s41587-023-01953-y
- CheckV: Nayfach, S., Camargo, A.P., Schulz, F. et al. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat Biotechnol 39, 578–585 (2021). https://doi.org/10.1038/s41587-020-00774-7
- Pharokka: George Bouras, Roshan Nepal, Ghais Houtak, Alkis James Psaltis, Peter-John Wormald, Sarah Vreugde, Pharokka: a fast scalable bacteriophage annotation tool, Bioinformatics, Volume 39, Issue 1, January 2023, btac776, https://doi.org/10.1093/bioinformatics/btac776
- PhaBOX: Shang, J., Peng, C., Guan, J., Cai, D., Wang, D., & Sun, Y. (2026). PhaBOX2: an enhanced web server for discovering and analyzing viral contigs in metagenomic data. Nucleic Acids Research, gkag382, https://doi.org/10.1093/nar/gkag382