Code for paper "Molecular Fingerprints Are Strong Models for Peptide Function Prediction".
ArXiv preprint: https://arxiv.org/abs/2501.17901
Dependencies are installed with uv. You can install exact versions used from uv.lock
by running uv sync in fresh Python 3.11 virtual environment.
To ensure you have everything set up, you can also run make setup. All datasets used are
already included in the data directory.
We have quite a lot of experiments and scripts. Here, we provide instructions on how to
reproduce results. However, note that the base method is very simple and can be run in
a few lines of Python. See train_fp_model function in src/main_classification_fingerprints/utils.py
for the code..
All scripts should be run directly from the directory, not from the project root.
For example: cd src/main_classification_fingerprints, then python lrgb.py.
Main molecular fingerprints results are in src/main_classification_fingerprints,
using molecular fingerprints. Each script is named after its benchmark. Run like: python lrgb.py.
Baselines, amino acid counts and ESM2, are in src/baselines_aminoacid_counts and
src/baselines_esm, respectively. They are run exactly like molecular fingerprints.
LRGB binary fingerprints are in src/other_experiments/lrgb_binary_with_seq_length.py.
To evaluate results of AMPBenchmark (Sidorczuk et al.), which are quite complex,
use util_scripts/evaluate_ampbenchmark_results.py script after running the classification
code.
Scripts are in src/analyses_sequence_shuffling directory. Run exactly like classification
experiments. Note that those take ~20x longer than plain classification (2 variants,
train vs train & test, 10 shuffle ratios).
Scripts are in src/analyses_length_splitdirectory. Run exactly like classification
experiments. Note that those take ~5x longer than plain classification (5 length buckets).
Run script src/other_experiments/long_range_seq_motifs.py. Note that this requires
GPU to run in a reasonable time, due to ESM2 finetuning.
For Random Forest and Extremely Randomized Trees on LRGB, run src/other_experiments/lrgb_classifiers.py
script.
util_scripts directory contains a few utility scripts:
create_bert_benchmark_datasets.pycreates train-test dataset splits for BERT AMPs benchmark; requires CD-HIT to be installedparse_autopeptideml_results.pyparses original results from AutoPeptideML, to create more readable CSV outputparse_peptidereactor_results.pyparses original results from PeptideReactor, which are a very complex JSON, and adds fingerprint results from training our models
src/plots_notebooks_stats contains Jupyter notebooks and scripts to create plots
and calculate statistics.