Skip to content

andrewbudge/phinder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

phinder

CI

A Nextflow pipeline for phage discovery from metagenomic assemblies.

Takes a combined contig FASTA and runs viral identification, quality assessment, annotation, and classification end-to-end.

contigs.fasta
    │
    ├─ geNomad          viral classification + provirus detection
    ├─ R filter         DTR topology always kept; Provirus filtered by score
    ├─ CheckV           completeness + quality estimation
    ├─ R filter         quality-based filtering with DTR bypass
    ├─ Pharokka         phage genome annotation (MMseqs2 + PyHMMER)
    │
    └─ PhaBOX*          taxonomy, lifestyle, host, vOTU clustering, phylogenetic tree
                        (*optional — requires separate installation)

Requirements


Quick start

phinder needs two things: the reference databases (downloaded once) and a way to provision the tools (a -profile — containers or conda).

1. Download the databases

curl -O https://raw.githubusercontent.com/andrewbudge/phinder/main/setup.sh
bash setup.sh --skip-envs        # databases only → ~/.phinder_dbs

Drop --skip-envs to also build pinned conda environments (genomad_phinder, checkv_phinder, pharokka_phinder) — only needed for the conda profile below. Add --with-phabox2 to set up the optional PhaBOX step.

2. Run

With Docker — pinned images, nothing to build (recommended):

nextflow run andrewbudge/phinder \
    --input contigs.fasta \
    --genomad_db  ~/.phinder_dbs/genomad_db \
    --checkv_db   ~/.phinder_dbs/checkv_db \
    --pharokka_db ~/.phinder_dbs/pharokka_db \
    -profile docker

With conda — uses the envs built by setup.sh (run it without --skip-envs):

nextflow run andrewbudge/phinder \
    --input contigs.fasta \
    --genomad_db  ~/.phinder_dbs/genomad_db \
    --checkv_db   ~/.phinder_dbs/checkv_db \
    --pharokka_db ~/.phinder_dbs/pharokka_db \
    --genomad_env  $(conda info --base)/envs/genomad_phinder \
    --checkv_env   $(conda info --base)/envs/checkv_phinder \
    --pharokka_env $(conda info --base)/envs/pharokka_phinder \
    -profile conda

Use -resume on reruns to skip completed steps. For HPC / Singularity / Apptainer, see Execution profiles.

Reproducible runs: pin a released version with -r, e.g. nextflow run andrewbudge/phinder -r v0.2.0 .... Without -r, Nextflow tracks the default branch (main), which moves. Pin -r, record the DB versions from DB_MANIFEST.tsv, and you can reproduce a run exactly.


Verify your install

Before running on real data, confirm Nextflow and your tool profile are wired up correctly — no databases required. This runs the whole pipeline on a tiny bundled dataset in stub mode, where every step emits placeholder outputs in seconds:

nextflow run andrewbudge/phinder -profile test -stub-run

A [SUCCESS] line with all 8 processes completed means your setup is good. To also check that your engine pulls images / builds envs, add it to the profile:

nextflow run andrewbudge/phinder -profile test,docker -stub-run   # or test,conda

This is the same check phinder's CI runs on every change.


Input

A single combined contig FASTA — uncompressed or gzipped. If working with multiple samples, merge and rename headers before passing to phinder to avoid ID collisions:

# Example: prefix each sample's contigs with its ID
zcat sample1_contigs.fasta.gz | sed 's/>/>sample1_/' >> combined.fasta
zcat sample2_contigs.fasta.gz | sed 's/>/>sample2_/' >> combined.fasta

Coverage filtering, circular contig extraction, and other pre-processing are left to the user — phinder is assembler-agnostic and makes no assumptions about header format. If your assembler embeds coverage=N in headers (e.g. metaMDBG), phinder will parse it and flag low-coverage candidates.


Output

results/
├── genomad/
│   ├── filtered_genomad.tsv       filtered geNomad hits
│   └── output/                    full geNomad output
├── checkv/
│   ├── potential_phage.tsv        final candidate table
│   └── output/                    full CheckV output
├── candidates/
│   └── candidate_phages.fna       candidate phage sequences
├── pharokka/
│   └── output/                    per-contig annotation files
├── phabox/                        (if --phabox2_env provided)
│   ├── end_to_end/                 taxonomy + lifestyle + host predictions
│   ├── votu/                       AAI-based vOTU clusters
│   └── tree/                       phylogenetic tree (terl + portal markers)
└── pipeline_info/
    └── versions.yml               tool versions used in this run

pipeline_info/versions.yml records the version of every tool the run invoked — pair it with the database DB_MANIFEST.tsv to fully describe a run when reporting results.


All parameters

Parameter Default Description
--input required Combined contig FASTA (.fa or .fa.gz)
--outdir results Output directory
--cpu_fraction 0.5 Fraction of detected CPU cores to use (1.0 = all)
--mem_fraction 0.5 Fraction of detected RAM to use (1.0 = all)
--max_cpus unset Exact core budget; overrides --cpu_fraction
--max_memory unset Exact memory budget, e.g. '64.GB'; overrides --mem_fraction
--avail_cpus auto-detected Override detected core count (if auto-detection is wrong)
--avail_mem auto-detected Override detected RAM in bytes (if auto-detection is wrong)
--genomad_db required Path to geNomad database
--checkv_db required Path to CheckV database
--pharokka_db required Path to Pharokka database
--phabox_db required if --phabox2_env set Path to PhaBOX database
--genomad_env builds from envs/genomad.yml Path to existing geNomad conda env
--checkv_env builds from envs/checkv.yml Path to existing CheckV conda env
--pharokka_env builds from envs/pharokka.yml Path to existing Pharokka conda env
--phabox2_env unset (PhaBOX skipped) Path to existing phabox2 conda env
--min_provirus_score 0.9 geNomad Provirus minimum virus_score
--checkv_quality_keep High-quality,Complete Comma-separated CheckV quality tiers to keep
--min_coverage 3 Coverage threshold for low-coverage flagging
--genomad_splits 20 geNomad MMseqs2 database splits (lower = faster, higher RAM)
--pharokka_gene_predictor prodigal-gv Gene predictor for Pharokka
--phabox_skip_phamer true Skip Phamer (sequences already confirmed viral)
--phabox_votu_mode AAI vOTU clustering mode (AAI or ANI)
--phabox_tree_markers terl,portal Marker genes for phylogenetic tree

Filtering logic

geNomad filter:

  • DTR topology → always kept (hallmark of complete linear phage genome)
  • Provirus topology → kept if virus_score >= --min_provirus_score
  • All other topologies → dropped

CheckV filter:

  • Kept if checkv_quality in --checkv_quality_keep
  • DTR contigs bypass CheckV quality thresholds (CheckV under-scores DTRs due to absent host flanking regions)
  • Low coverage is flagged (low_coverage = TRUE) but not dropped

Execution profiles

Pick how tools are provisioned with -profile. Profiles are composable (comma-separated):

Profile Tools provided by Use when
conda conda envs built from envs/*.yml local conda/mamba install
mamba same, resolved with mamba faster conda solves
docker pinned biocontainer images laptop / workstation
singularity same images, via Singularity HPC without root
apptainer same images, via Apptainer HPC without root
slurm (executor only) submit processes as SLURM jobs

Containers pull pinned, frozen tool images — no conda solve, identical on every machine. The reference databases are still downloaded separately (via setup.sh) regardless of profile, and passed with the --*_db flags; Nextflow mounts them into the container automatically.

# Laptop, with Docker
nextflow run andrewbudge/phinder --input contigs.fasta \
    --genomad_db ... --checkv_db ... --pharokka_db ... \
    -profile docker

# HPC, Singularity images submitted as SLURM jobs
nextflow run andrewbudge/phinder ... -profile singularity,slurm

PhaBOX (optional) currently has no container image and runs via conda only. Combine -profile docker with --phabox2_env <env> if you need it, or omit PhaBOX under the container profiles.


Performance & resources

By default phinder uses half your machine — half the detected CPU cores and half the RAM. This is deliberately conservative: it runs out of the box on a laptop or a small VM without ever failing with "process requirement exceeds available CPUs/memory", and it's polite on a shared login node. Work is split across three tiers (process_low/medium/high); the heavy steps (geNomad, CheckV) get the full budget, lighter steps get a share.

To go faster, give it more — two ways:

# By fraction — run on a bigger machine and use more of it
nextflow run andrewbudge/phinder ... --cpu_fraction 1.0 --mem_fraction 1.0   # the whole machine
nextflow run andrewbudge/phinder ... --cpu_fraction 0.75 --mem_fraction 0.75 # leave some headroom

# By exact amount — ultimate control (overrides the fractions)
nextflow run andrewbudge/phinder ... --max_cpus 32 --max_memory '128.GB'

--cpu_fraction/--max_cpus and --mem_fraction/--max_memory are independent — e.g. cap memory at an exact --max_memory '64.GB' while letting cores stay at the default fraction. The chosen budget is what the heavy steps (geNomad, CheckV) get; lighter steps take a share.

If auto-detection guesses wrong (e.g. inside a container with cgroup limits, or on a scheduler that hides the true node size), set the machine size explicitly with --avail_cpus N and --avail_mem <bytes>. For per-step control, drop in a -c custom.config overriding the process_low/medium/high labels.

geNomad memory scales with --genomad_splits (default 20). If you hit a memory wall, raise it; if you have RAM to spare and want speed, lower it.


Running on HPC

Add -profile slurm to submit processes as SLURM jobs (compose with a tool profile, e.g. conda or singularity):

nextflow run andrewbudge/phinder ... -profile singularity,slurm

On a scheduler each process is sized by its process_low/medium/high label. Tune them for your partition with a -c custom.config — for example:

process {
    withLabel: process_high { cpus = 32; memory = 128.GB; time = '24h' }
}

Setup options

bash setup.sh [--db-dir DIR] [--skip-envs] [--skip-dbs] [--with-phabox2]

  --db-dir DIR      Database directory (default: ~/.phinder_dbs)
  --skip-envs       Skip conda environment creation
  --skip-dbs        Skip database downloads
  --with-phabox2    Also install the phabox2 conda environment

Database provenance

phinder does not pin database versions — it orchestrates the underlying tools and lets each one fetch its current database. That keeps you on the versions the tool authors recommend, which is what most analyses want. If you need a specific version instead, download it yourself and point the matching --*_db flag at it.

So that a run can still be described after the fact, setup.sh writes a DB_MANIFEST.tsv into the database directory recording each database's version, the tool version that fetched it, its source, and the date:

database  db_version  tool_version  source                     recorded_utc
genomad   1.9         1.11.2        genomad download-database  2026-06-07T22:23:38Z
checkv    1.5         1.0.3         checkv download_database   2026-06-07T22:23:38Z
pharokka  1.8.0       1.8.2         install_databases.py       2026-06-07T22:23:38Z
phabox    2.2         2.2           github.com/.../phabox_db_v2_2.zip  2026-06-07T22:23:38Z

Re-running setup.sh refreshes the entries for whatever databases are present, so the manifest always reflects what is on disk. Include it when reporting results or filing issues.


Citation

If you use phinder in your work, please cite the underlying tools:

  • geNomad: Camargo, A.P., Roux, S., Schulz, F. et al. Identification of mobile genetic elements with geNomad. Nat Biotechnol 42, 1303–1312 (2024). https://doi.org/10.1038/s41587-023-01953-y
  • CheckV: Nayfach, S., Camargo, A.P., Schulz, F. et al. CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nat Biotechnol 39, 578–585 (2021). https://doi.org/10.1038/s41587-020-00774-7
  • Pharokka: George Bouras, Roshan Nepal, Ghais Houtak, Alkis James Psaltis, Peter-John Wormald, Sarah Vreugde, Pharokka: a fast scalable bacteriophage annotation tool, Bioinformatics, Volume 39, Issue 1, January 2023, btac776, https://doi.org/10.1093/bioinformatics/btac776
  • PhaBOX: Shang, J., Peng, C., Guan, J., Cai, D., Wang, D., & Sun, Y. (2026). PhaBOX2: an enhanced web server for discovering and analyzing viral contigs in metagenomic data. Nucleic Acids Research, gkag382, https://doi.org/10.1093/nar/gkag382

About

A Nextflow pipeline for the detection, quality assessment, annotation, and classification of bacteriophages from metagenomically assembled contigs

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors