GenoPHI

(jee-no-fee)

Genotype-to-Phenotype Phage-Host Interaction Prediction

GenoPHI is a Python package for machine learning-based prediction of genotype-phenotype relationships using whole-genome sequence data. Originally designed for phage-host interaction prediction, GenoPHI supports both binary interaction prediction and regression tasks for any microbial phenotype. The package implements protein family-based and k-mer-based approaches to extract genomic features from amino acid sequences and predict phenotypes using CatBoost gradient boosting models.

Workflow Overview

Figure 1: GenoPHI workflow schematic showing the main analysis pipelines: Protein family-based workflow, K-mer-based workflow, and Predictive protein k-mer workflow. Each pathway includes feature extraction, selection, model training, and prediction steps.

Features

Protein Family-Based Analysis

MMSeqs2 Clustering: Cluster protein sequences into protein families based on sequence similarity
Feature Table Generation: Create presence-absence matrices of protein families across genomes and consolidate into predictive features based on co-occurence across genomes
Feature Selection: Identify predictive protein families (multiple available methods: RFE, SHAP, SHAP-RFE, ANOVA, Chi-squared, Lasso)
Model Training: Train CatBoost models with hyperparameter optimization
Phenotype Prediction: Predict interactions, resistance, or other phenotypes for new genomes
Feature Annotation: Identify predictive protein sequences from predictive features

K-mer-Based Analysis

K-mer Feature Extraction: Generate k-mer features from protein sequences with or without gene family context
Predictive K-mer Workflow: Extract k-mers specifically from predictive protein families identified in protein family analysis
Feature Selection & Modeling: Apply same robust feature selection and modeling pipelines
Flexible K-mer Lengths: Support for single k-mer length or ranges (e.g., 3-6)

Application Modes

Phage-Host Interaction Prediction: Binary prediction of infection outcomes between phages and bacterial strains
Single-Strain Phenotype Prediction: Predict strain-level phenotypes (e.g., antibiotic resistance, growth rate) without requiring phage data
Regression Tasks: Predict continuous phenotypes (e.g., infection efficiency, metabolic rates)
General Feature-Based Modeling: Use any feature table with a phenotype column for custom applications

Advanced Capabilities

Dynamic Feature Weighting: Account for feature frequency distributions to handle imbalanced features
Clustering-Based Selection: Use HDBSCAN or hierarchical clustering for intelligent feature grouping
Multiple Feature Selection Methods: RFE, SHAP-RFE, SelectKBest, Chi-squared, Lasso, SHAP
Comprehensive Performance Metrics: AUC-ROC, Precision-Recall, MCC, F1-score, Accuracy
SHAP Interpretability: Feature importance analysis and visualization for model explainability
Bootstrapping Support: Robust model evaluation with multiple train-test splits

Installation

System Requirements

Minimum Requirements:

Python 3.8 or higher
8 GB RAM
4 CPU cores
10 GB free disk space

Recommended for Large Datasets:

Python 3.10+
32+ GB RAM
8+ CPU cores
50+ GB free disk space (depending on dataset size)

Tested Operating Systems:

Linux (Ubuntu 20.04+, CentOS 7+)
macOS (Sonoma 14+, Apple Silicon)

Virtual Environment (Recommended)

Create and activate a conda environment:

conda create -n genophi python=3.10
conda activate genophi

Install GenoPHI

From PyPI (Recommended):

pip install genophi

From GitHub (Development):

git clone https://github.com/Noonanav/GenoPHI.git
cd GenoPHI
pip install -e .

For development with optional dependencies:

pip install -e ".[dev]"

Install MMseqs2

External Dependency: GenoPHI requires MMseqs2 for protein sequence clustering and assignment.

Install via conda/mamba:

conda install -c bioconda mmseqs2
# or
mamba install -c bioconda mmseqs2

For other installation methods, see the MMSeqs2 Wiki.

Verify Installation

Test that GenoPHI is properly installed:

# Check GenoPHI version
genophi --version

# Verify MMseqs2 is accessible
mmseqs version

# Run basic help command
genophi --help

Typical Install Time

Full installation (conda environment + GenoPHI + MMseqs2) takes approximately 2-3 minutes on a standard desktop computer (tested on a MacBook Pro M2, 16 GB RAM, macOS Sonoma 14.3).

Demo

A small test dataset is included in the repository for demonstrating the software. To run the demo:

git clone https://github.com/Noonanav/GenoPHI.git
cd GenoPHI

genophi protein-family-workflow \
    --input_path_strain data/test_data/strain_AAs/ \
    --input_path_phage data/test_data/phage_AAs/ \
    --phenotype_matrix data/test_data/ecoli_test_interaction_matrix.csv \
    --output_dir demo_output/ \
    --threads 4 \
    --num_features 50 \
    --num_runs_fs 5 \
    --num_runs_modeling 10 \
    --method rfe \
    --filter_type strain

Test dataset: 25 E. coli strains and 25 phages with a binary interaction matrix.

Expected output: A demo_output/ directory containing MMseqs2 clustering results, feature selection outputs, trained models, performance metrics, and a workflow summary report. See Output Directory Structure for details.

Expected run time: ~25 minutes on a standard desktop computer (MacBook Pro M2, 16 GB RAM).

Quick Start

GenoPHI provides a unified command-line interface accessible through the genophi command:

# View available commands
genophi --help

# Get help for a specific command
genophi protein-family-workflow --help

Recommended Default Run

For most phage-host interaction prediction tasks, use these recommended settings:

genophi protein-family-workflow \
    --input_path_strain strain_fastas/ \
    --input_path_phage phage_fastas/ \
    --phenotype_matrix interactions.csv \
    --output_dir results/ \
    --threads 8 \
    --num_features 100 \
    --num_runs_fs 25 \
    --num_runs_modeling 50 \
    --method rfe \
    --use_clustering \
    --cluster_method hierarchical \
    --n_clusters 20 \
    --filter_type strain \
    --use_shap

Key Parameters Explained:

--num_features 100: Select top 100 features (adjust based on dataset size)
--num_runs_fs 25: 25 iterations for robust feature selection
--num_runs_modeling 50: 50 modeling runs for reliable performance estimates
--method rfe: Recursive Feature Elimination (balanced performance)
--use_clustering: Enable sample clustering-aware filtering
--filter_type strain: Critical for phage-host prediction - Ensures train/test splits separate by strain so the model learns to predict on new strains it hasn't seen before
--use_shap: Generate SHAP plots and feature importance analysis for model interpretability

Note: For phage-host interaction prediction, --filter_type strain is strongly recommended. This controls how train/test splits are made during feature selection and modeling, ensuring the model never sees the same strain in both training and testing. This forces the model to learn generalizable patterns rather than memorizing specific strain characteristics.

For single-strain phenotypes (no phage data):

genophi protein-family-workflow \
    --input_path_strain strain_fastas/ \
    --phenotype_matrix phenotypes.csv \
    --output_dir results/ \
    --threads 8 \
    --sample_column strain \
    --phenotype_column resistance \
    --filter_type none

Usage

CLI Commands Overview

GenoPHI provides the following main commands:

Command	Description
`protein-family-workflow`	Recommended basic workflow: Complete protein family-based workflow
`full-workflow`	Protein families → k-mers from predictive proteins
`kmer-workflow`	Complete k-mer-based workflow from all proteins
`cluster`	Generate protein family clusters and feature tables
`select-features`	Perform feature selection on any feature table
`train`	Train predictive models on selected features
`predict`	Predict phenotypes using trained models
`select-and-train`	Feature selection + modeling from any feature table
`assign-features`	Assign features to new genomes
`assign-predict`	Assign features and predict (protein families)
`annotate`	Annotate predictive features with functional info
`kmer-assign-features`	Assign k-mer features to new genomes
`kmer-assign-predict`	Assign k-mer features and predict
`kmer-analysis`	Analyze k-mer composition and diversity

Workflows

1. Protein Family Workflow (Recommended)

The primary workflow for most applications. Performs complete protein family clustering, feature selection, and modeling.

Complete Workflow

genophi protein-family-workflow \
    --input_path_strain strain_fastas/ \
    --input_path_phage phage_fastas/ \
    --phenotype_matrix interactions.csv \
    --output_dir results/ \
    --threads 8 \
    --num_features 100 \
    --num_runs_fs 25 \
    --num_runs_modeling 50 \
    --method rfe \
    --filter_type strain \
    --use_shap

Output Structure:

results/
├── strain/                  # Strain MMseqs2 outputs
├── phage/                   # Phage MMseqs2 outputs (if provided)
├── merged/                  # Merged strain+phage feature table directory (if phage input)
│   └── full_feature_table.csv
├── feature_selection/       # Selected features and occurrence counts
│   ├── filtered_feature_tables/
│   └── features_occurrence.csv
├── modeling_results/        # Models and performance metrics
│   ├── cutoff_3/, cutoff_4/, cutoff_5/, ...
│   ├── model_performance/   # Summary plots, metrics, predictive_proteins/
│   ├── select_features_model_performance.csv
│   └── select_features_model_predictions.csv
├── workflow_report.txt      # Runtime/performance summary
└── workflow_report.csv      # Parameters and runtime metrics

Single-Strain Phenotype Prediction

For strain-level phenotypes (no phage data required):

genophi protein-family-workflow \
    --input_path_strain strain_fastas/ \
    --phenotype_matrix strain_phenotypes.csv \
    --output_dir results/ \
    --threads 8 \
    --sample_column strain \
    --phenotype_column antibiotic_resistance \
    --task_type classification \
    --filter_type none

Phenotype Matrix Format:

strain,antibiotic_resistance
Strain_001,1
Strain_002,0
Strain_003,1

For regression:

--task_type regression \
--phenotype_column growth_rate

2. Full Workflow: Protein Families → Predictive K-mers

This workflow first identifies predictive protein families, then extracts k-mers specifically from those families for refined modeling. This combines the interpretability of protein families with the resolution of k-mer analysis.

genophi full-workflow \
    --input_strain strain_fastas/ \
    --input_phage phage_fastas/ \
    --phenotype_matrix interactions.csv \
    --output results/ \
    --k 5 \
    --threads 8

Workflow Steps:

Cluster proteins into families
Perform feature selection on protein families
Extract k-mers from predictive protein families only
Train models on k-mer features
Generate annotations for predictive k-mers

3. K-mer Workflow (De Novo)

Generate k-mer features from all proteins without prior protein family analysis:

genophi kmer-workflow \
    --input_strain_dir strain_fastas/ \
    --input_phage_dir phage_fastas/ \
    --phenotype_matrix interactions.csv \
    --output kmer_results/ \
    --k 4 \
    --threads 8 \
    --num_features 100 \
    --num_runs_fs 25 \
    --num_runs_modeling 50

K-mer Specific Parameters:

--k 4: K-mer length (default: 4)
--k_range: Generate k-mers from length 3 to k
--one_gene: Include features with only one gene (default: False)

Advanced Options:

--use_dynamic_weights \        # Apply dynamic weighting
--weights_method inverse_frequency \       # Weighting method
--no-clustering \              # Disable clustering-aware train/test splitting
--use_shap                     # Save SHAP-based interpretation outputs

4. Modular Step-by-Step Workflows

Step 1: Clustering and Feature Generation

genophi cluster \
    --input_strain strain_fastas/ \
    --input_phage phage_fastas/ \
    --phenotype_matrix interactions.csv \
    --output clustering_results/ \
    --min_seq_id 0.4 \
    --coverage 0.8 \
    --sensitivity 7.5 \
    --threads 8

Clustering Parameters:

--min_seq_id 0.4: Minimum sequence identity (range: 0-1)
--coverage 0.8: Minimum coverage (range: 0-1)
--sensitivity 7.5: MMseqs2 sensitivity (higher = more sensitive, slower)

Step 2: Feature Selection

Feature selection works on any feature table with a phenotype column:

genophi select-features \
    --input feature_table.csv \
    --output feature_selection/ \
    --method rfe \
    --num_features 100 \
    --num_runs 25 \
    --filter_type strain \
    --phenotype_column interaction \
    --threads 8

Feature Selection Methods:

rfe: Recursive Feature Elimination (recommended)
shap_rfe: RFE using SHAP values
select_k_best: ANOVA F-test (fast)
chi_squared: Chi-squared test
lasso: L1 regularization
shap: Direct SHAP importance

Advanced Selection Options:

--use_dynamic_weights \               # Handle imbalanced features
--weights_method inverse_frequency \  # Weighting strategy
--no-clustering \                     # Disable clustering (enabled by default)
--cluster_method hierarchical \       # Clustering algorithm
--n_clusters 20                       # Number of clusters

Step 3: Model Training

Train models from selected features (directory input):

genophi train \
    --input feature_selection/filtered_feature_tables \
    --output models/ \
    --num_runs 50 \
    --phenotype_column interaction \
    --threads 8

For regression tasks:

--task_type regression \
--phenotype_column efficiency

Advanced Training Options:

--set_filter strain \                    # Filter type: none, strain, phage (default: strain)
--use_dynamic_weights \                  # Apply dynamic feature weighting
--weights_method inverse_frequency \     # Weighting: log10, inverse_frequency, balanced
--no-clustering \                        # Disable clustering-aware splits
--cluster_method hierarchical \          # Clustering: hdbscan, hierarchical (default: hierarchical)
--n_clusters 20                          # Number of clusters (default: 20)

Step 4: Select-and-Train (Combined)

Run feature selection and modeling together from any feature table:

genophi select-and-train \
    --full_feature_table custom_feature_table.csv \
    --output results/ \
    --method rfe \
    --num_features 100 \
    --num_runs_fs 25 \
    --num_runs_modeling 50 \
    --phenotype_column your_phenotype \
    --sample_column your_sample_id \
    --threads 8

This command is flexible and works with:

Protein family features
K-mer features
Custom features
Any feature table with any phenotype / output column

5. Prediction Workflows

Predict from Assigned Features

Generate predictions for new genome combinations using pre-assigned features:

genophi predict \
    --input_dir strain_feature_tables/ \
    --model_dir models/cutoff_10 \
    --output_dir predictions/ \
    --phage_feature_table phage_features.csv \
    --threads 8

Parameters:

--input_dir: Directory with strain-specific feature tables
--model_dir: Directory containing trained models
--phage_feature_table: Path to phage feature table (optional for single-strain mode)
--strain_source: Prefix for strain features (default: strain)
--phage_source: Prefix for phage features (default: phage)

Assign Features and Predict (Protein Families)

genophi assign-predict \
    --input_dir new_strains/ \
    --mmseqs_db results/tmp/strain/mmseqs_db \
    --clusters_tsv results/strain/clusters.tsv \
    --feature_map results/strain/features/selected_features.csv \
    --tmp_dir tmp_assign/ \
    --model_dir results/modeling_results/cutoff_10 \
    --phage_feature_table results/phage/features/feature_table.csv \
    --output_dir predictions/ \
    --genome_type strain

For new phages:

--input_dir new_phages/ \
--mmseqs_db results/tmp/phage/mmseqs_db \
--clusters_tsv results/phage/clusters.tsv \
--tmp_dir tmp_assign_phage/ \
--strain_feature_table results/strain/features/feature_table.csv \
--output_dir predictions/ \
--genome_type phage

K-mer Assignment and Prediction

genophi kmer-assign-predict \
    --input_dir new_strains/ \
    --mmseqs_db results/tmp/strain/mmseqs_db \
    --clusters_tsv results/strain/clusters.tsv \
    --feature_map results/strain/features/selected_features.csv \
    --filtered_kmers kmer_analysis/strain/filtered_kmers.csv \
    --aa_sequence_file kmer_results/strain_combined.faa \
    --tmp_dir tmp_kmer_assign/ \
    --model_dir kmer_results/modeling/modeling_results/cutoff_10 \
    --output_dir predictions/ \
    --genome_type strain

6. Feature Annotation

Identifies proteins associated predictive protein families or k-mers and merges with functional information:

genophi annotate \
    --feature_file_path results/feature_selection/filtered_feature_tables/select_feature_table_cutoff_10.csv \
    --feature2cluster_path results/strain/features/selected_features.csv \
    --cluster2protein_path results/strain/clusters.tsv \
    --fasta_dir_or_file strain_fastas/ \
    --modeling_dir results/modeling_results/cutoff_10 \
    --annotation_table_path annotations.csv \
    --output_dir annotations/ \
    --feature_type strain

Input Data Formats

FASTA Files

Protein sequences in FASTA format (.faa files):

>protein_id_1
MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVHSLAKWKRQ...
>protein_id_2
MRISTTITTTITITTGNGAG...

Important: Protein IDs must be unique across all genomes. If duplicates exist, GenoPHI will automatically prefix them with genome names.

Phenotype Matrix

Phage-Host Interactions

Binary classification (infection/no infection):

strain,phage,interaction
Strain_001,Phage_A,1
Strain_001,Phage_B,0
Strain_002,Phage_A,1

Regression (infection efficiency):

strain,phage,efficiency
Strain_001,Phage_A,0.85
Strain_001,Phage_B,0.12
Strain_002,Phage_A,0.93

Single-Strain Phenotypes

Classification:

strain,antibiotic_resistance
Strain_001,1
Strain_002,0
Strain_003,1

Regression:

strain,growth_rate
Strain_001,0.42
Strain_002,0.38
Strain_003,0.51

Column Names: Use --strain_column, --phage_column, --sample_column, and --phenotype_column to specify your column names.

Feature Selection Methods

Method	Description	Best For	Speed
RFE (recommended)	Recursive Feature Elimination	General use, balanced performance	Medium
SHAP-RFE	RFE using SHAP values	Model-agnostic importance	Slow (High RAM)
SelectKBest	ANOVA F-test	Fast screening, linear relationships	Fast
Chi-squared	χ² test for independence	Categorical features	Fast
Lasso	L1 regularized regression	Sparse models, multicollinearity	Fast
SHAP	Shapley Additive Explanations	Direct feature importance	Slow (High RAM)

Dynamic Weighting

Handle imbalanced feature distributions:

--use_dynamic_weights \
--weights_method inverse_frequency  # or log10, balanced

When to use:

Features with highly variable occurrence frequencies
Some features present in most genomes, others very rare
Imbalanced positive/negative examples

Clustering-Based Selection

Group correlated features for more robust selection:

--use_clustering \
--cluster_method hierarchical \  # or hdbscan
--n_clusters 20

HDBSCAN Options:

--cluster_method hdbscan \
--min_cluster_size 5 \
--min_samples 5 \
--cluster_selection_epsilon 0.0

Performance Metrics

Classification Metrics

AUC-ROC (Area Under ROC Curve): Overall discriminative ability $$\text{AUC} = \int_{0}^{1} \text{TPR}(\text{FPR}) , d\text{FPR}$$
MCC (Matthews Correlation Coefficient): Balanced metric for all confusion matrix elements $$\text{MCC} = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}}$$
Precision: Proportion of true positives among predicted positives $$\text{Precision} = \frac{TP}{TP + FP}$$
Recall (Sensitivity): Proportion of true positives among actual positives
$$\text{Recall} = \frac{TP}{TP + FN}$$
F1 Score: Harmonic mean of precision and recall $$F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$
Accuracy: Overall correct predictions $$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$

Regression Metrics

RMSE: Root Mean Squared Error
MAE: Mean Absolute Error
R²: Coefficient of determination

Visualization Outputs

GenoPHI generates comprehensive visualizations:

Per-Run Plots (modeling_results/cutoff_*/run_*/):

Confusion matrices (classification)
ROC curves with AUC scores
Precision-Recall curves
SHAP feature importance bar plots
SHAP value scatter plots (beeswarm)

Summary Plots (modeling_results/model_performance/):

SHAP beeswarm plots across all runs
ROC curve comparisons across feature selection cutoffs
Precision-Recall curve comparisons
Hit rate and hit ratio curves

Example visualizations from the original README:

SHAP feature importance summary

ROC curves comparing different feature selection cutoffs

Precision-Recall curves

Hit rate analysis

Hit ratio analysis

Output Directory Structure

Protein Family Workflow

output_dir/
├── strain/
│   ├── clusters.tsv
│   ├── presence_absence_matrix.csv
│   └── features/
│       ├── feature_table.csv
│       ├── selected_features.csv
│       └── feature_assignments.csv
├── phage/                       # Created when --input_path_phage is provided
│   ├── clusters.tsv
│   ├── presence_absence_matrix.csv
│   └── features/
│       ├── feature_table.csv
│       ├── selected_features.csv
│       └── feature_assignments.csv
├── merged/                      # Created when phage input is provided
│   └── full_feature_table.csv
├── full_feature_table.csv       # Created for single-strain mode (no phage input)
├── feature_selection/
│   ├── filtered_feature_tables/
│   │   ├── select_feature_table_cutoff_3.csv
│   │   ├── select_feature_table_cutoff_10.csv
│   │   └── ...
│   └── features_occurrence.csv
├── modeling_results/
│   ├── cutoff_3/, cutoff_4/, cutoff_5/, ...
│   ├── model_performance/
│   │   ├── model_performance_metrics.csv
│   │   └── predictive_proteins/
│   ├── select_features_model_performance.csv
│   └── select_features_model_predictions.csv
├── tmp/
│   ├── strain/
│   └── phage/                   # Created when phage input provided
├── workflow_report.txt
└── workflow_report.csv

K-mer Workflow

output_dir/
├── strain_combined.faa
├── strain_proteins.csv
├── phage_combined.faa           # Optional
├── phage_proteins.csv           # Optional
├── feature_tables/
│   ├── strain_feature_table.csv
│   ├── final_feature_table.csv
│   ├── phage_feature_table.csv  # Optional
│   ├── phage_final_feature_table.csv  # Optional
│   └── selected_features.csv
├── full_feature_table.csv
├── modeling/
│   ├── feature_selection/
│   └── modeling_results/
│       ├── cutoff_*/
│       ├── model_performance/model_performance_metrics.csv
│       ├── select_features_model_performance.csv
│       └── select_features_model_predictions.csv
├── workflow_report.txt
└── kmer_workflow_report.csv

Python API

GenoPHI can also be used programmatically:

from genophi.workflows import (
    run_kmer_workflow,
    run_modeling_workflow_from_feature_table,
    assign_predict_workflow
)
from genophi.workflows.protein_family_workflow import run_protein_family_workflow

# Recommended: Protein family workflow
run_protein_family_workflow(
    input_path_strain="strain_fastas/",
    input_path_phage="phage_fastas/",
    phenotype_matrix="interactions.csv",
    output_dir="results/",
    threads=8,
    num_features=100,
    num_runs_fs=50,
    num_runs_modeling=100,
    method='rfe',
    filter_type='strain'
)

# _K_-mer workflow
run_kmer_workflow(
    input_strain_dir="strain_fastas/",
    input_phage_dir="phage_fastas/",
    phenotype_matrix="interactions.csv",
    output_dir="kmer_results/",
    k=5,
    threads=8,
    num_features=100
)

# Feature selection and modeling from any feature table
run_modeling_workflow_from_feature_table(
    full_feature_table="custom_features.csv",
    output_dir="modeling_results/",
    phenotype_column="your_phenotype",
    sample_column="your_sample_id",
    num_features=100,
    num_runs_fs=50,
    num_runs_modeling=100,
    method='rfe'
)

# Prediction workflow
assign_predict_workflow(
    input_dir="new_genomes/",
    mmseqs_db="results/tmp/strain/mmseqs_db",
    clusters_tsv="results/strain/clusters.tsv",
    feature_map="results/strain/features/selected_features.csv",
    tmp_dir="tmp_assign/",
    model_dir="results/modeling_results/cutoff_10",
    output_dir="predictions/",
    genome_type='strain',
    phage_feature_table_path="results/phage/features/feature_table.csv"
)

Advanced Usage

Custom Train-Test Splits

Control how data is split during training and testing:

--filter_type none      # Random split (default, but not recommended for phage-host)
--filter_type strain    # Leave-strain-out CV splits - RECOMMENDED for phage-host
--filter_type phage     # Leave-phage-out CV splits

Important Recommendation: For phage-host interaction prediction, always use --filter_type strain. This controls how train/test splits are made during each iteration:

strain: Ensures the same strain never appears in both training and testing sets within an iteration. Forces the model to learn features that generalize to completely new bacterial strains.
none: Random splits allow the same strains in both sets, leading to overly optimistic performance because the model can memorize strain-specific patterns.
phage: Leave-phage-out splits, useful for testing generalization to new phages.

The split type fundamentally changes what the model learns, not just how it's evaluated.

Hyperparameter Tuning

Models use grid search for hyperparameter optimization. Default parameters are optimized for phage-host interaction prediction but can be customized in the code.

Handling Large Datasets

For large datasets, adjust memory and threading:

--max_ram 64 \           # Maximum RAM in GB
--threads 16 \           # Number of CPU threads
--clear_tmp              # Remove temporary files after completion

Regression Tasks

For continuous phenotypes:

--task_type regression \
--phenotype_column efficiency

GenoPHI will use appropriate regression metrics (RMSE, MAE, R²) instead of classification metrics.

Pre-processing Feature Clustering

Filter features by cluster presence before modeling:

--use_feature_clustering \
--feature_cluster_method hierarchical \
--feature_n_clusters 20 \
--feature_min_cluster_presence 2

This removes features that appear in fewer than feature_min_cluster_presence genome clusters. This option set is available in protein-family and full workflows (protein-family-workflow, cluster, and full-workflow).

Example Datasets

To help you get started, we recommend testing GenoPHI with:

Test Dataset Structure

Minimal test case with 10 strains, 5 phages:

test_data/
├── strains/
│   ├── Strain_001.faa
│   ├── Strain_002.faa
│   └── ...
├── phages/
│   ├── Phage_A.faa
│   ├── Phage_B.faa
│   └── ...
└── interactions.csv

Troubleshooting

Common Issues

Issue: MMseqs2 not found

Solution: Ensure MMseqs2 is installed and in your PATH
conda install -c bioconda mmseqs2
which mmseqs  # Should show the path

Issue: Out of memory errors

Solution: 
- Reduce --max_ram parameter
- Process fewer genomes at once
- Use --clear_tmp to remove intermediate files
- Increase system swap space

Issue: Duplicate protein IDs

Solution: GenoPHI automatically detects and prefixes duplicates with genome names
If you want to prevent this, ensure protein IDs are unique across all input files

Issue: No predictive features found

Solution: 
- Try different feature selection methods (--method)
- Adjust num_features parameter
- Check that phenotype matrix has sufficient positive/negative examples
- Verify that interaction matrix matches genome filenames

Issue: Poor model performance

Solution: 
- Increase num_runs_fs and num_runs_modeling for more robust results
- Try different feature selection methods
- Use --use_dynamic_weights for imbalanced features
- Enable --use_clustering for feature grouping
- Check data quality and ensure phenotype matrix is correct
- Try different clustering parameters (min_seq_id, coverage)

Issue: Models take too long to train

Solution:
- Reduce num_runs_modeling
- Reduce num_features
- Increase --threads parameter
- Use faster feature selection methods (select_k_best, chi_squared)

Testing

GenoPHI includes a comprehensive test suite organized into multiple tiers for different testing scenarios.

Quick Validation

# Verify installation
pytest -m smoke -v

# Run all tests (requires MMSeqs2)
pytest -v

Test Organization

Smoke tests (<5 seconds): Package installation verification
Integration tests (~30-45 min): Module-to-module interactions
End-to-end tests (~60-90 min): Complete workflow validation

See tests/README.md for detailed testing documentation, including:

How to run specific test tiers
Test data organization
Baseline metrics for regression testing
CI/CD recommendations

Running Specific Test Tiers

pytest -m smoke              # Quick installation check
pytest -m integration        # Module integration tests
pytest -m e2e                # End-to-end workflows
pytest -m "not requires_mmseqs2"  # Skip MMSeqs2-dependent tests

Frequently Asked Questions (FAQ)

General Questions

Q: Can GenoPHI be used for organisms other than phages and bacteria?
A: Yes! While designed for phage-host interactions, GenoPHI works with any protein sequences and phenotypes.

Q: How many genomes do I need for reliable predictions?
A: Minimum: ~20 strains and 20 phages with ~400 interactions. Recommended: 50+ strains, 50+ phages, 5000+ interactions for robust models.

Q: What file formats are required?
A: FASTA files (.faa) for protein sequences and CSV for phenotype matrices. See Input Data Formats for details.

Technical Questions

Q: What's the difference between protein family and k-mer approaches?
A: Protein families group similar full-length proteins (interpretable, captures protein-level patterns). K-mers analyze short amino acid sequences (high resolution, captures local patterns). The full-workflow combines both by extracting k-mers from predictive protein families.

Q: Should I use single-strain or phage-host mode?
A: Use phage-host mode for interaction prediction. Use single-strain mode for strain-level phenotypes (resistance, growth rate, etc.) where phage data isn't relevant.

Q: Which feature selection method should I use?
A: Start with RFE (balanced performance).

Q: How do I interpret SHAP plots? A: SHAP beeswarm plots show feature importance. Features at the top are most important. Red dots = high feature values, blue = low. Position right of center = positive impact on prediction. Enable with --use_shap flag during model training.

Q: Can I use custom features instead of protein families?
A: Yes! Use select-and-train with any feature table containing a phenotype column (metabolic pathways, gene presence/absence, etc.).

Q: How do I handle imbalanced datasets?
A: Use --use_dynamic_weights with --weights_method inverse_frequency to balance feature importance. CatBoost also has built-in class balancing.

Troubleshooting Questions

Q: Models perform poorly - what should I try?
A: (1) Increase num_runs for more robust estimates, (2) Try different clustering parameters, (3) Enable dynamic weighting, (4) Check data quality and phenotype matrix accuracy.

Q: How much RAM do I need?
A: Minimum 8 GB. Recommend 16+ GB for 50+ genomes, 32+ GB for 100+ genomes. Use --max_ram to limit memory usage.

Best Practices

Start with recommended defaults for initial analysis
For phage-host predictions: Always use --filter_type strain - This forces the model to learn generalizable patterns by ensuring strains in the test set are never seen during training (critical!)
Run multiple iterations (num_runs_fs = 25, num_runs_modeling = 50) for robust results
Enable clustering (--use_clustering) for correlated features
Check data quality before modeling - ensure phenotype matrix matches genome filenames exactly
Use SHAP plots (--use_shap) to understand which features drive predictions and for model interpretability
For single-strain phenotypes: Use --filter_type none or omit (random splits are appropriate when no phage data)

Version History

v1.0.0 (Current)

First stable release
Protein family-based workflow with MMseqs2 clustering
K-mer-based workflow with flexible k-mer lengths
Multiple feature selection methods (RFE, SHAP, SelectKBest, Chi-squared, Lasso)
CatBoost model training with hyperparameter optimization
SHAP-based interpretability
Support for classification and regression tasks
Single-strain and phage-host prediction modes
Unified CLI with 14 commands
Comprehensive visualization outputs

Upcoming Features

Web interface for prediction and visualization
Docker container for easy deployment

Publication Datasets

The datasets used in the GenoPHI publication are included in the data/ directory for reproducibility and benchmarking purposes.

data/
├── experimental_validation/
│   ├── BASEL_ECOR_interaction_matrix.csv    # BASEL collection against ECOR strains for model validation
│   └── ECOR27_TnSeq_high_fitness_genes.csv  # Filtered RB-TnSeq results
├── interaction_matrices/
│   ├── ecoli_interaction_matrix.csv          # E. coli phage-host interactions
│   ├── ecoli_interaction_matrix_subset.csv   # Smaller E. coli subset for testing
│   ├── klebsiella1_interaction_matrix.csv    # Klebsiella dataset 1
│   ├── klebsiella2_interaction_matrix.csv    # Klebsiella dataset 2
│   ├── pseudomonas_interaction_matrix.csv    # Pseudomonas interactions
│   └── vibrio_interaction_matrix.csv         # Vibrionaceae interactions
└── test_data/                                # Test datasets for test suite

Manuscript Scripts

Analysis scripts used to generate figures and results for the GenoPHI publication are available in the manuscript_scripts/ directory. These scripts demonstrate advanced usage patterns and reproduce the analyses presented in the paper.

Citation

If you use GenoPHI in your research, please cite:

@article{noonan2025genophi,
  author = {Noonan, Avery J. C. and Moriniere, Lucas and Rivera-López, Edwin O. and Patel, Krish and Pena, Melina and Svab, Madeline and Kazakov, Alexey and Deutschbauer, Adam and Dudley, Edward G. and Mutalik, Vivek K. and Arkin, Adam P.},
  title = {Phylogeny-agnostic strain-level prediction of phage-host interactions from genomes},
  year = {2025},
  doi = {10.1101/2025.11.15.688630},
  publisher = {Cold Spring Harbor Laboratory},
  url = {https://www.biorxiv.org/content/10.1101/2025.11.15.688630v1},
  journal = {bioRxiv}
}

Preprint: Noonan, A.J.C., Moriniere, L., Rivera-López, E.O., Patel, K., Pena, M., Svab, M., Kazakov, A., Deutschbauer, A., Dudley, E.G., Mutalik, V.K., & Arkin, A.P. (2025). Phylogeny-agnostic strain-level prediction of phage-host interactions from genomes. bioRxiv. https://doi.org/10.1101/2025.11.15.688630

Contributing

We welcome contributions to GenoPHI! Here's how you can help:

Ways to Contribute

Report bugs or suggest features via GitHub Issues
Submit pull requests for bug fixes or new features
Improve documentation by fixing typos or adding examples
Share your use cases and publications using GenoPHI
Test on different platforms and report compatibility issues

Development Guidelines

Fork the repository and create a feature branch
Follow PEP 8 style guidelines for Python code
Add tests for new functionality
Update documentation as needed
Submit a pull request with a clear description of changes

Code of Conduct

We are committed to providing a welcoming and inclusive environment. Please be respectful and constructive in all interactions.

License

This software is available under the MIT License. See the LICENSE file for details.

This software is subject to Lawrence Berkeley National Laboratory copyright. The U.S. Government retains certain rights as this software was developed under funding from the U.S. Department of Energy.

Support

For questions, issues, or feature requests:

Open an issue on GitHub
Contact: Avery Noonan (averynoonan@gmail.com)

Acknowledgments

This was completed as part of the BRaVE Phage Foundry at Lawrence Berkeley National Laboratory which is supported by the U.S. Department of Energy, Office of Science, Office of Biological & Environmental Research under contract number DE-AC02-05CH11231. This work was also supported by the National Science Foundation (NSF) of the United States under grant award No. 2220735 (EDGE CMT: Predicting bacteriophage susceptibility from Escherichia coli genotype).

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
data		data
genophi		genophi
images		images
manuscript_scripts		manuscript_scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation