Skip to content

MLCIL/peptides_molecular_fingerprints_classification

Repository files navigation

Molecular Fingerprints Are Strong Models for Peptide Function Prediction

Code for paper "Molecular Fingerprints Are Strong Models for Peptide Function Prediction".

ArXiv preprint: https://arxiv.org/abs/2501.17901

Setup

Dependencies are installed with uv. You can install exact versions used from uv.lock by running uv sync in fresh Python 3.11 virtual environment.

To ensure you have everything set up, you can also run make setup. All datasets used are already included in the data directory.

Reproducing results

We have quite a lot of experiments and scripts. Here, we provide instructions on how to reproduce results. However, note that the base method is very simple and can be run in a few lines of Python. See train_fp_model function in src/main_classification_fingerprints/utils.py for the code..

Main classification results

All scripts should be run directly from the directory, not from the project root. For example: cd src/main_classification_fingerprints, then python lrgb.py.

Main molecular fingerprints results are in src/main_classification_fingerprints, using molecular fingerprints. Each script is named after its benchmark. Run like: python lrgb.py.

Baselines, amino acid counts and ESM2, are in src/baselines_aminoacid_counts and src/baselines_esm, respectively. They are run exactly like molecular fingerprints.

LRGB binary fingerprints are in src/other_experiments/lrgb_binary_with_seq_length.py.

To evaluate results of AMPBenchmark (Sidorczuk et al.), which are quite complex, use util_scripts/evaluate_ampbenchmark_results.py script after running the classification code.

Shuffling

Scripts are in src/analyses_sequence_shuffling directory. Run exactly like classification experiments. Note that those take ~20x longer than plain classification (2 variants, train vs train & test, 10 shuffle ratios).

Length splits

Scripts are in src/analyses_length_splitdirectory. Run exactly like classification experiments. Note that those take ~5x longer than plain classification (5 length buckets).

Sequence motifs long-range task

Run script src/other_experiments/long_range_seq_motifs.py. Note that this requires GPU to run in a reasonable time, due to ESM2 finetuning.

Other classifiers

For Random Forest and Extremely Randomized Trees on LRGB, run src/other_experiments/lrgb_classifiers.py script.

Additional scripts & plots

util_scripts directory contains a few utility scripts:

  • create_bert_benchmark_datasets.py creates train-test dataset splits for BERT AMPs benchmark; requires CD-HIT to be installed
  • parse_autopeptideml_results.py parses original results from AutoPeptideML, to create more readable CSV output
  • parse_peptidereactor_results.py parses original results from PeptideReactor, which are a very complex JSON, and adds fingerprint results from training our models

src/plots_notebooks_stats contains Jupyter notebooks and scripts to create plots and calculate statistics.

Releases

No releases published

Packages

 
 
 

Contributors