Skip to content

BCV-Uniandes/ESCAPE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ESCAPE: A Standardized Benchmark for Multilabel Antimicrobial Peptide Classification

Paper License: CC BY NC Project Website

Sebastian Ojeda, Rafael Velasquez, Nicolás Aparicio, Juanita Puentes, Paula Cárdenas, Nicolás Andrade, Gabriel González, Sergio Rincón, Carolina Muñoz-Camargo and, Pablo Arbeláez
Universidad de Los Andes, Colombia

Expanded Standardized Collection for Antimicrobial Peptide Evaluation (ESCAPE) is an experimental framework for multilabel antimicrobial peptide classification. It combines a large-scale curated dataset, a benchmark for evaluating models, and a transformer-based baseline that integrates both sequence and structural information.


ESCAPE Database

The ESCAPE Dataset integrates over 80,000 peptide sequences from 27 validated public repositories to address critical limitations in existing AMP resources, including data fragmentation, inconsistent annotations, and limited functional coverage. It distinguishes antimicrobial peptides from negative sequences and organizes their functional annotations into a biologically meaningful multilabel hierarchy, covering antibacterial, antifungal, antiviral, and antiparasitic activities.The dataset comprises 21,409 experimentally validated AMPs and 60,950 non-AMPs filtered from unrelated sources.

The ESCAPE Dataset is available for download. You can access the complete ESCAPE Database on Harvard Dataverse.


ESCAPE Benchmark

We evaluate eight representative models for antimicrobial peptide classification: AMPlify, AMP BERT, TransImbAMP, amPEPpy, AMPs Net, AVP-IFT, PEP Net and the ESCAPE Baseline, using the multilabel framework defined by the ESCAPE Benchmark. Each model was modified to support multilabel classification and trained with two fold cross validation. We report final performance by averaging predictions from both folds through an ensemble strategy. Evaluation uses two standard metrics for multilabel tasks: F1 score and mean Average Precision, which are suitable for datasets with class imbalance.

The table below summarizes the key methods for antimicrobial peptide classification of the ESCAPE Benchmark, their primary architectures, GitHub repositories, and the F1-score and mean Average Precision (mAP) these methods achieve by evaluating them on the ESCAPE Dataset.

Method Primary Architecture GitHub Repository F1-score (%) mAP (%)
AMPs-Net GCN GitHub 57.7 ± 0.70 54.6 ± 0.86
TranslmbAMP Transformer-Based GitHub 62.0 ± 0.70 64.9 ± 1.11
AMP-BERT BERT GitHub 64.7 ± 0.64 66.9 ± 1.17
amPEPpy Random Forest (RF) GitHub 66.5 ± 0.37 68.5 ± 0.48
PEP-Net Transformer-Based GitHub 65.5 ± 0.61 68.4 ± 0.53
AVP-IFT Contrastive-Learning + Transformer GitHub 66.5 ± 0.59 68.8 ± 0.50
AMPlify Bi-LSTM with attention layers GitHub 68.5 ± 0.77 70.3 ± 0.87
ESCAPE Baseline (ours) Dual-branch transformer GitHub 69.8 ± 0.43 72.1 ± 0.60

📦 Getting Started

1. Clone the ESCAPE repository.

git clone https://github.com/BCV-Uniandes/ESCAPE.git

2. Install general dependencies. To set up the environment and install the necessary dependencies, run the following commands:

conda env create -f ESCAPE.yml
conda activate ESCAPE_env

🧪 Reproducing ESCAPE Benchmark Results

To reproduce the ESCAPE Benchmark results on the ESCAPE Dataset:

1. Update the paths to both model checkpoints in the src/ensemble.sh executable script.

2. Set the model architecture in the test_ESCAPE.py file.

3. Run the following command:

bash src/ensemble.sh

This script loads both trained models, averages their outputs, and computes the final metrics over the test set.


ESCAPE Baseline

The ESCAPE Baseline is a dual-branch transformer architecture designed to classify antimicrobial peptides (AMPs) using both sequence and structural information. It processes amino acid sequences through a transformer encoder and structural representations through a second branch that encodes peptide distance matrices. These two modalities are fused using a bidirectional cross-attention mechanism, enabling the model to capture both biological context and spatial structure. This approach achieves state-of-the-art overall performance on the ESCAPE Benchmark, outperforming existing methods in both F1-score and mean Average Precision.

🧬 Structural Inputs

For the structural branch, each peptide is represented as a 224×224 distance matrix, where each element corresponds to the Euclidean distance between Cα atoms in the 3D conformation. We extract these structures from UniProt when available, or predict them using RosettaFold or AlphaFold3. The resulting distance matrices are precomputed for all peptides and stored as .npy files.

1. Download distance matrices. You can download the distance matrices for the test set from this link.

2. Set the distance matrix path. Modify the path to the folder containing the distance matrices in the test.py file to ensure the model can load the correct structural inputs during evaluation.


📊 ESCAPE Baseline Evaluation

We evaluate the ESCAPE Baseline on the ESCAPE Benchmark using two standard metrics for multilabel classification: F1-score and mean Average Precision (mAP). This model achieves state-of-the-art overall performance, outperforming six existing AMP classifiers across both metrics. To reproduce the evaluation of the ESCAPE Baseline:

1. Download trained model checkpoints. You can download the .pth files for both folds from this link.

2. Update the script configuration. Set the correct paths to both checkpoints in the src/ensemble.sh script, and ensure that the MultiModalClassifier architecture from src/models.py is properly initialized in src/test_ESCAPE.py.

3. Run ensemble evaluation. Use the following command:

bash src/ensemble.sh

🔧 Training ESCAPE Baseline

To reproduce the training procedure for the ESCAPE Baseline, this repository provides the complete training pipeline, including argument handling, model initialization, and data loading. All input paths and training parameters are defined in src/args.py, and are passed through the executable script src/train.sh. These arguments include the locations of the ESCAPE CSV files, the directory containing the structural distance matrices, optimization settings, and the model configuration.

1. Set the training arguments in src/args.py. This file defines all required parameters, including learning rate, batch size, number of epochs, and output directories. The model is selected through the --mode argument, which supports three options: sequence (sequence-only transformer), distance (distance-matrix transformer), and MultiModal (dual-branch architecture used as the baseline).

2. Modify the src/train.sh script to provide the correct paths to the ESCAPE training, validation, and test partitions, as well as the folder containing the 224×224 structural matrices. Any additional argument defined in args.py may also be adjusted directly from this script.

3. Run the following command to execute training:

bash src/train.sh

Website License

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors