Predict lineages for new genome sequences using a trained model.
- Load model bundle (zstd-compressed)
- Validate model format version and integrity
- Stream input FASTA in batches of 512
- Vectorize each batch via the same feature hasher used during training
- Predict in parallel — each tree votes, majority wins
- Write results immediately (constant memory)
pathotypr predict \
--input query_genomes.fasta \
--model model.pathotypr.zst \
--output predictions.tsv \
--threads 8 \
--excel| Flag | Default | Description |
|---|---|---|
-i, --input |
required | FASTA file with query sequences (supports .gz) |
-m, --model |
required | Trained model bundle (.pathotypr.zst) |
-o, --output |
required | Output TSV path |
-t, --threads |
all cores | Number of CPU threads |
--excel |
off | Also generate .xlsx alongside TSV |
| Column | Description |
|---|---|
Header |
Sequence header from input FASTA |
Predicted_Lineage |
Most-voted class label |
Confidence |
Fraction of trees voting for the winner (0–1) |
Confidence_Margin |
Gap between winner and runner-up (0–1) |
Other_Votes |
Top 3 alternative predictions with vote fractions |
- Streaming: processes 512 sequences at a time — RAM usage is O(batch), not O(N)
- Parallel: vectorization and tree voting are parallelized with rayon
- Gzip support: handles
.fasta.gztransparently via needletail - Confidence interpretation: 0.95 = 95% of trees agree; margin 0.40 = winner has 40% more votes than runner-up
For in-depth documentation of the underlying algorithms:
- Prediction — Streaming batch processing, ensemble voting, confidence and margin metrics
- Random Forest — Tree traversal via binary search on sparse rows, majority voting
- Feature Hashing — How sequences are vectorized identically at train and predict time