Prosit MultiFrag is a recurrent neural network trained jointly on 5 different types of fragmentation spectra -> HCD, ECD, EID, UVPD, and ETciD. The model outputs 815 ion types, which include ion series a, a+proton, b, c, c+proton, x, x+proton, y, z, z+proton, up to length 29 and product charge +3.
The original model, published on Koina, was trained on ~2.1 million unique PSMs, obtained through MSFragger searches, roughly equally divided between the 5 fragmentation types. Each fragmentation type was run on digests using 5 different enzymes: LysN, LysC, GluC, Trypsin, and Chymotrypsin. The instrument used was an Orbitrap Exploris (Thermo Fisher Scientific) equipped with an Omnitrap (Fasmatech). The raw files were provided by Dr. Shabaz Mohammed of The University of Oxford, Oxford, England. Project data can be found in https://zenodo.org/records/15755223
This repository provides all relevant code for recapitulating the project, from data processing into training ready datasets, to model training and evaluation.
data/- Annotation, data processing, and training set creation scriptsyaml/annotate.yaml- Settings for how to annotate raw datacreate_dataset.yaml- Settings for how to create training data
enumerate_tokens.py- Utilities for tokenizing modified sequences and determining token dictionarymass_scale.py- Utilities for calculating fragment masses and annotationmerged_search_into_training_data.py- Turning annotation results (all_psms.py) into a Pytorch/Huggingface ready training datasettest_annotation.py- Script for annotating raw data with search results
torch/- Model training and testing codemodels/model_parts.py- Layers for building peptide encoderpeptide_encoder.py- Transformer modelprosit.py- RNN model
yamlmaster.yaml- Main settings for running trainingloader.yaml- Settings for datasets and dataloading filterseval.yaml- Settings for evaluationmodel.yaml- Architectural settings for model
loader_hf.py- Code for Huggingface dataset/loadermain.py- Main script for running model training and testinglosses.py- Code for training loss and evaluation functionsutils.py- Miscellaneous utilities
git@github.com:wilhelm-lab/Prosit_multifrag.git
Libraries necessary to run code in the repository
torch- Model training and deploymentdatasets- Huggingface datasets and utilitieswandb- Model training monitoringpandas- Used throughout projectnumpy- Used throughout projecttqdm- Used throughout projectoktoberfest- Utilities for processing raw datayaml- Used throughout project
- Enter
datadirectory
cd data
- Set configuration settings in
yaml/annotate.yaml - Run annotation for all fragmentation methods
- Must run the script once for every annotation method
python test_annotation.py
- Outputs annotation results in file named
all_psms.parquet- one for each fragmenation method
- Enter
datadirectory
cd data
- Set configuration settings in
yaml/create_dataset.yaml - Run script to create training parquet files from annotation results
python merged_search_into_training_data.py
- Get token dictionary by enumerating tokens in resulting training files
python enumerate_tokens.py {traing_set_data_directory}
- Go into torch directory
cd torch
- Set all configuration files to appropriate settings for model, data, training
- Run training
python main.py
- Go into
torchdirectory
cd torch
- Set configuration settings in
eval.yaml - Run main script with any argument
python main.py {any argument}
- Outputs a parquet results file
Publication (pre-print): MultiFrag pre-print
For questions, please contact joel.lapin@tum.de