This repository contains all the code and scripts necessary to reproduce the analysis presented in the manuscript: "Chemical Probes in Scientific Literature: Expanding and Validating Target-Disease Evidence." Preprint here
This project leverages Natural Language Processing (NLP) to systematically analyse open-access scientific literature for evidence of novel target-disease (T-D) associations mediated by chemical probes. The pipeline identifies chemical probe-target-disease (P-T-D) triplets, quantifies their prevalence, and performs a temporal analysis to determine cases where probe-based evidence predates other non-literature evidence streams in platforms like Open Targets. The goal is to highlight the value of chemical probe literature as a resource for early-stage target validation and hypothesis generation in drug discovery.
βββ π 1_pilot/ # Code and data for the initial pilot study.
β βββ π data/ # Data files specific to the pilot.
β βββ π figs/ # Figures generated during the pilot.
β βββ π databases.py # Scripts for database interactions in the pilot.
β βββ π note_ProbesDataset.ipynb # Exploratory notebook for the pilot dataset.
β
βββ π 2_probesLit/ # Code and data for the main systematic analysis pipeline.
β βββ π data/ # Intermediate and final datasets.
β βββ π figs/ # Final figures for the manuscript.
β βββ π tables/ # Final tables for the manuscript.
| βββ π 0_download_OT_files.py. # Steo 0: download all open target files that are needed
β βββ π 1_get_ner_hq_probes.py # Step 1: Fetches NER data for High-Quality (HQ) probes.
β βββ π 2_get_triplets.py # Step 2: Extracts P-T-D triples from the NER data.
β βββ π 3_filter_targets.py # Step 3: Filters triples based on validated targets.
β βββ π 4_get_OT_evidence.py # Step 4: Retrieves existing evidence from Open Targets.
β βββ π 5_get_OT_dated_evidence.py # Step 5: Retrieves time-stamped evidence from Open Targets.
β βββ π 6_map_disease_therapeutic_area.py # Step 6: Retrieves therapeutic area (parent ontology id) from Open targets for the disease term
| βββ π 7_get_drug_max_phase.py # Step 7: Obtains the max clinical phase (if any drug available) for the specific T-D pair
β βββ π 8_plotting_data.ipynb # Jupyter Notebook for final data visualisation.
| βββ π tools.py # Helper scripts for handling data.
β
βββ π 3_probesHQ/ # Scripts and data to build the High-Quality (HQ) chemical probes dictionary.
β βββ π files/ # Raw data files for the HQ probe dictionary.
β βββ π probes_hq_dataset.ipynb # Notebook for creating and analysing the HQ probe dataset.
β βββ π tools.py # Helper scripts for handling probe data.
β
βββ π .gitignore # Specifies files and directories to be ignored by Git.
βββ βοΈ LICENCE # Licence especifications
βββ π requirements.txt # File with required packages to create the Python environment.
βββ π README.md # This file.
We are committed to transparency and reproducibility. This section outlines the availability of the data for both the main systematic pipeline and the preliminary pilot study.
The main systematic pipeline is fully reproducible. All scripts and code required to run the analysis are available in this GitLab repository. The public input files are detailed in the Input Files section. Due to their size, the intermediate and final datasets generated by the systematic pipeline are not stored on GitLab. They are publicly available for download from our dedicated repository at: [Insert Zenodo/Figshare/Repository URL here].
The pilot study served as an initial proof-of-concept. To support our findings, all code, intermediate data, final results, figures, and plots from this preliminary analysis are publicly available directly in the 1_pilot/ directory of this repository. Please note that the initial, dictionary-based annotated dataset used as the starting point for the pilot is not included due to its legacy format. However, the most valuable output of this pilot workβthe manually curated bioactivity dataβhas been formally integrated into the public domain and is now available as part of the ChEMBL 36 database release.
- Annotated Literature
This is the primary dataset from the EuropePMC NER pipeline, containing entity matches from scientific articles. The analysis focuses exclusively on successful "matches"βdata where entities have been grounded to a standard Open Targets identifier.
- Public Access: The required successful matches are available on the Open Targets FTP server
https://ftp.ebi.ac.uk/pub/databases/opentargets/platform/latest/intermediate/literature_match/
- Internal Access (for development): The full dataset including "FailedMatches" (version 24.09) is available to Open Targets partners at
gs://open-targets-pre-data-releases/24.09/output/etl/parquet/literature/matches
- Open Targets Evidence
This dataset contains all existing target-disease association evidence curated by Open Targets.
- Public Access: http://ftp.ebi.ac.uk/pub/databases/opentargets/platform/25.09/output/association_by_datasource_indirect
- Dated Open Targets Evidence
This dataset provides time stamps for the evidence, which is crucial for the temporal analysis.
- Public Access: A public version is available at Zenodo: https://zenodo.org/records/15922783
-
Therapeutic area (parent ontology id) for the disease term assigned by Open Targets.
-
Maximum clinical phase if any drug for the T-D pairs.
This project requires a local copy of the ChEMBL database (version 35) to extract chemical probe synonyms and other relevant metadata.
- Navigate to the ChEMBL 35 release page on the EBI FTP server:
https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_35/
- Download the MySQL dump file: chembl_35_mysql.tar.gz.
- Decompress the .tar.gz file in a local directory of your choice.
- Configure the path in the notebooks: You must provide the local file path to the SQL dump in the following two Jupyter notebooks:
1_pilot/note_ProbesDataset.ipynb
3_probesHQ/probes_hq_dataset.ipynb
In each notebook, find the configuration cell near the top and update the variable chemblpath for the ChEMBL path with the location of the file on your system.
All scripts and notebooks in this project are written in Python. The code was developed and tested specifically with Python version 3.11.7 to ensure reproducibility and compatibility with the required packages. Using a different version may lead to unexpected errors.
Some scripts use PySpark version 3.3.1 to load and work with large datasets.
This pipeline is designed to process large-scale literature and evidence datasets. While PySpark is utilised for its efficiency, the computational and memory requirements remain substantial. Running the full analysis on a standard local machine is likely to be very slow or may fail due to insufficient memory.
For a timely and successful execution, we strongly recommend setting up the environment and running the scripts on a high-performance computing (HPC) cluster.
To provide a benchmark, the analysis for the manuscript was performed using the following resources. Users should aim for a comparable environment:
- Memory (RAM): The PySpark steps are the most memory-intensive stages of the pipeline. A minimum of 250GB is recommended for the script in step 1 (1_get_ner_hq_probes.py).
- CPU Cores: The pipeline was executed using 24 cores for script 1 and 6 for all others. Performance will scale with the number of available cores, particularly for the PySpark jobs.
- Total Runtime: The full systematic pipeline took approximately 1.5 hours to complete on the hardware specified above
- Clone the repository in your preferred local path
git clone https://gitlab.ebi.ac.uk/chembl/research/chemical_probes_lit.git
cd chemical_probes_lit
ls
You should be able to see the project content.
- Create the python enviroment
It is recommended to create the python enviroment outside the git project. Go to your preferred path and create the enviroment.
cd your_path
python -m venv cprobesenv
- Activate the enviroment
Mac or Linux:
source cprobesenv/bin/activate
Windows:
.\cprobesenv\Scripts\activate.bat
- Install the required packages using the requirements.csv file
pip install -r requirements.txt
The systematic pipeline consists of 7 sequential Python scripts and a final Jupyter Notebook for plotting. It is critical to run these scripts in your python enviroment and in the specified order, as each step generates an intermediate file that serves as the input for the next. The sequential scripts have been conducted in separated steps to maximize intermediate analysis and speed up independent updates or modifications.
-
Get NER Data for HQ Probes
Script: 2_probesLit/1_get_ner_hq_probes.py.
Description: Processes the raw Open Targets literature data to extract all sentences containing mentions of the High-Quality (HQ) chemical probes.
Input: Raw literature data (--matches) and High Quality chemical probes dataset (--probes) .
Output: An intermediate file containing annotated data from articles mentioning at least one HQ probe in the label field (epmc_ner_results_hq_probes.tsv) and a final file containing annotated data from articles previously found in intermediate file (1_epmc_ner_results_hq_probes_all_sent.tsv).Command to execute:
python chemical_probes_lit/2_probesLit/1_get_ner_hq_probes.py --matches [path_to_matches_folder] --probes [path_to_HQ_chemical_probes_dictionary_dataset] -
Get P-T-D Triples
Script: 2_probesLit/2_get_triplets.py
Description: Takes the output from step 1 and identifies potential probe-target-disease (P-T-D) triples co-occurring within the same sentence.
Input: File from step 1 and HQ chemical probes dataset file
Output: An intermediate file containing raw P-T-D triples (2_ner_probes_triplets.csv).Command to execute:
python chemical_probes_lit/2_get_triplets.py --input data/1_epmc_ner_results_hq_probes_all_sent.tsv --probes [path_to_HQ_chemical_probes_dictionary_dataset] -
Filter Targets
Script: 2_probesLit/3_filter_targets.py
Description: Filters the triples from step 2, retaining only those where the identified target is a known, validated target for the specific chemical probe.
Input: Raw intermediate file with triples from step 2
Output: An intermediate file with high-confidence, filtered triples to include the exact validated target for each chemical probe (3_ner_probes_triplets_ptpairs.csv).Command to execute:
python chemical_probes_lit/3_filter_targets.py --input data/2_ner_probes_triplets.csv --probes [path_to_HQ_chemical_probes_dictionary_dataset] -
Get Open Targets Evidence
Script: 2_probesLit/4_get_OT_evidence.py
Description: This script retrieves all existing evidence for the T-D pairs from the main Open Targets evidence dataset.
Input: Dataset with filtered triples from step 3 and the OT evidence dataset.
Output: An intermediate file merging probe evidence with existing OT evidence (4_ner_probes_triplets_ptpairs_ev.csv).Command to execute:
python chemical_probes_lit/4_get_OT_evidence.py --input data/3_ner_probes_triplets_ptpairs.csv --evidence [path_to_OT_evidence_folder] -
Get Dated Open Targets Evidence
Script: 2_probesLit/5_get_OT_dated_evidence.py
Description: This script retrieves and merges the time-stamped (dated) evidence for the T-D pairs.
Input: Data from step 4 and the dated OT evidence dataset.
Output: The final, fully annotated dataset ready for analysis (5_ner_probes_triplets_ptpairs_evd.csv).Command to execute:
python chemical_probes_lit/5_get_OT_dated_evidence.py --input data/4_ner_probes_triplets_ptpairs_ev.csv --datedevidence [path_to_OT_dated_evidence_folder] -
Get Therapeutic area of disease term from Open Targets
Script: 2_probesLit/6_map_disease_therapeutic_area.py
Description: This script retrieves and merges the therapeutic area for the disease term. Input: Data from step 5 and the disease OT data.
Output: The annotated data set with the preferred therapeutic area id for the disease (6_ner_probes_triplets_ptpairs_evd_ta.csv).Command to execute:
python chemical_probes_lit/6_map_disease_therapeutic_area.py --input data/5_ner_probes_triplets_ptpairs_evd.csv --otdisease [path_to_OT_disease_file] -
Get the max clinical phase of drugs (from Open Targets) for the T-D pairs
Script: 2_probesLit/6_get_max_drug_phase.py
Description: This script retrieves and merges the max clinical phase for the T-D pair (e.g. 0,1,2,3,etc). Input: Data from step 6 and the known_drug OT data.
Output: The annotated data set with the max clinical phase found for the T-D pairs (7_ner_probes_triplets_ptpairs_dr.csv).Command to execute:
python chemical_probes_lit/7_get_max_drug_phase.py --input data/6_ner_probes_triplets_ptpairs_evd_ta.csv --otdrug [path_to_OT_known_drug_file] -
Plotting and Visualisation
File: 2_probesLit/6_plotting_data.ipynb
Description: This final Jupyter Notebook takes the processed data from the previous steps to generate the figures and tables presented in the manuscript. To run it, open and execute the cells sequentially in a Jupyter environment.
This project is licensed under the MIT License. This means you are free to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the software for any purpose, including commercial use.
For the full license text, please see the LICENSE file in this repository.