Molecular Fingerprints Are Strong Models for Peptide Function Prediction

Code for paper "Molecular Fingerprints Are Strong Models for Peptide Function Prediction".

ArXiv preprint: https://arxiv.org/abs/2501.17901

Setup

Dependencies are installed with uv. You can install exact versions used from uv.lock by running uv sync in fresh Python 3.11 virtual environment.

To ensure you have everything set up, you can also run make setup. All datasets used are already included in the data directory.

Reproducing results

We have quite a lot of experiments and scripts. Here, we provide instructions on how to reproduce results. However, note that the base method is very simple and can be run in a few lines of Python. See train_fp_model function in src/main_classification_fingerprints/utils.py for the code..

Main classification results

All scripts should be run directly from the directory, not from the project root. For example: cd src/main_classification_fingerprints, then python lrgb.py.

Main molecular fingerprints results are in src/main_classification_fingerprints, using molecular fingerprints. Each script is named after its benchmark. Run like: python lrgb.py.

Baselines, amino acid counts and ESM2, are in src/baselines_aminoacid_counts and src/baselines_esm, respectively. They are run exactly like molecular fingerprints.

LRGB binary fingerprints are in src/other_experiments/lrgb_binary_with_seq_length.py.

To evaluate results of AMPBenchmark (Sidorczuk et al.), which are quite complex, use util_scripts/evaluate_ampbenchmark_results.py script after running the classification code.

Shuffling

Scripts are in src/analyses_sequence_shuffling directory. Run exactly like classification experiments. Note that those take ~20x longer than plain classification (2 variants, train vs train & test, 10 shuffle ratios).

Length splits

Scripts are in src/analyses_length_splitdirectory. Run exactly like classification experiments. Note that those take ~5x longer than plain classification (5 length buckets).

Sequence motifs long-range task

Run script src/other_experiments/long_range_seq_motifs.py. Note that this requires GPU to run in a reasonable time, due to ESM2 finetuning.

Other classifiers

For Random Forest and Extremely Randomized Trees on LRGB, run src/other_experiments/lrgb_classifiers.py script.

Additional scripts & plots

util_scripts directory contains a few utility scripts:

create_bert_benchmark_datasets.py creates train-test dataset splits for BERT AMPs benchmark; requires CD-HIT to be installed
parse_autopeptideml_results.py parses original results from AutoPeptideML, to create more readable CSV output
parse_peptidereactor_results.py parses original results from PeptideReactor, which are a very complex JSON, and adds fingerprint results from training our models

src/plots_notebooks_stats contains Jupyter notebooks and scripts to create plots and calculate statistics.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
results		results
src		src
util_scripts		util_scripts
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
ruff.toml		ruff.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Molecular Fingerprints Are Strong Models for Peptide Function Prediction

Setup

Reproducing results

Main classification results

Shuffling

Length splits

Sequence motifs long-range task

Other classifiers

Additional scripts & plots

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Molecular Fingerprints Are Strong Models for Peptide Function Prediction

Setup

Reproducing results

Main classification results

Shuffling

Length splits

Sequence motifs long-range task

Other classifiers

Additional scripts & plots

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages