Chemical Probes in Scientific Literature: A Pipeline for Target-Disease Evidence

This repository contains all the code and scripts necessary to reproduce the analysis presented in the manuscript: "Chemical Probes in Scientific Literature: Expanding and Validating Target-Disease Evidence." Preprint here

Abstract

This project leverages Natural Language Processing (NLP) to systematically analyse open-access scientific literature for evidence of novel target-disease (T-D) associations mediated by chemical probes. The pipeline identifies chemical probe-target-disease (P-T-D) triplets, quantifies their prevalence, and performs a temporal analysis to determine cases where probe-based evidence predates other non-literature evidence streams in platforms like Open Targets. The goal is to highlight the value of chemical probe literature as a resource for early-stage target validation and hypothesis generation in drug discovery.

Project structure

├── 📂 1_pilot/                               # Code and data for the initial pilot study.  
│   ├── 📂 data/                              # Data files specific to the pilot.  
│   ├── 📂 figs/                              # Figures generated during the pilot.  
│   ├── 📜 databases.py                       # Scripts for database interactions in the pilot.  
│   └── 📓 note_ProbesDataset.ipynb           # Exploratory notebook for the pilot dataset.  
│
├── 📂 2_probesLit/                           # Code and data for the main systematic analysis pipeline.  
│   ├── 📂 data/                              # Intermediate and final datasets.  
│   ├── 📂 figs/                              # Final figures for the manuscript.  
│   ├── 📂 tables/                            # Final tables for the manuscript.  
|   ├── 📜 0_download_OT_files.py.            # Steo 0: download all open target files that are needed
│   ├── 📜 1_get_ner_hq_probes.py             # Step 1: Fetches NER data for High-Quality (HQ) probes.  
│   ├── 📜 2_get_triplets.py                  # Step 2: Extracts P-T-D triples from the NER data.  
│   ├── 📜 3_filter_targets.py                # Step 3: Filters triples based on validated targets.  
│   ├── 📜 4_get_OT_evidence.py               # Step 4: Retrieves existing evidence from Open Targets.  
│   ├── 📜 5_get_OT_dated_evidence.py         # Step 5: Retrieves time-stamped evidence from Open Targets.  
│   ├── 📜 6_map_disease_therapeutic_area.py  # Step 6: Retrieves therapeutic area (parent ontology id) from Open targets for the disease term 
|   ├── 📜 7_get_drug_max_phase.py            # Step 7: Obtains the max clinical phase (if any drug available) for the specific T-D pair
│   ├── 📓 8_plotting_data.ipynb              # Jupyter Notebook for final data visualisation.
|   └── 📜 tools.py                           # Helper scripts for handling data.
│
├── 📂 3_probesHQ/                            # Scripts and data to build the High-Quality (HQ) chemical probes dictionary.
│   ├── 📂 files/                             # Raw data files for the HQ probe dictionary.
│   ├── 📓 probes_hq_dataset.ipynb            # Notebook for creating and analysing the HQ probe dataset.
│   └── 📜 tools.py                           # Helper scripts for handling probe data.
│
├── 📜 .gitignore                             # Specifies files and directories to be ignored by Git.
├── ⚖️ LICENCE                                # Licence especifications
├── 📜 requirements.txt                       # File with required packages to create the Python environment.  
└── 📄 README.md                              # This file.

Data Availability

We are committed to transparency and reproducibility. This section outlines the availability of the data for both the main systematic pipeline and the preliminary pilot study.

Systematic Pipeline

The main systematic pipeline is fully reproducible. All scripts and code required to run the analysis are available in this GitLab repository. The public input files are detailed in the Input Files section. Due to their size, the intermediate and final datasets generated by the systematic pipeline are not stored on GitLab. They are publicly available for download from our dedicated repository at: [Insert Zenodo/Figshare/Repository URL here].

Pilot Study

The pilot study served as an initial proof-of-concept. To support our findings, all code, intermediate data, final results, figures, and plots from this preliminary analysis are publicly available directly in the 1_pilot/ directory of this repository. Please note that the initial, dictionary-based annotated dataset used as the starting point for the pilot is not included due to its legacy format. However, the most valuable output of this pilot work—the manually curated bioactivity data—has been formally integrated into the public domain and is now available as part of the ChEMBL 36 database release.

Systematic Pipeline Execution

Prerequisites

Input files

Annotated Literature

This is the primary dataset from the EuropePMC NER pipeline, containing entity matches from scientific articles. The analysis focuses exclusively on successful "matches"—data where entities have been grounded to a standard Open Targets identifier.

- Public Access: The required successful matches are available on the Open Targets FTP server
https://ftp.ebi.ac.uk/pub/databases/opentargets/platform/latest/intermediate/literature_match/
- Internal Access (for development): The full dataset including "FailedMatches" (version 24.09) is available to Open Targets partners at 
gs://open-targets-pre-data-releases/24.09/output/etl/parquet/literature/matches

Open Targets Evidence

This dataset contains all existing target-disease association evidence curated by Open Targets.

- Public Access: http://ftp.ebi.ac.uk/pub/databases/opentargets/platform/25.09/output/association_by_datasource_indirect

Dated Open Targets Evidence

This dataset provides time stamps for the evidence, which is crucial for the temporal analysis.

- Public Access: A public version is available at Zenodo: https://zenodo.org/records/15922783

Therapeutic area (parent ontology id) for the disease term assigned by Open Targets.
- Public access: http://ftp.ebi.ac.uk/pub/databases/opentargets/platform/25.09/output/disease
Maximum clinical phase if any drug for the T-D pairs.
- Public access: http://ftp.ebi.ac.uk/pub/databases/opentargets/platform/25.09/output/known_drug

ChEMBL database

This project requires a local copy of the ChEMBL database (version 35) to extract chemical probe synonyms and other relevant metadata.

Navigate to the ChEMBL 35 release page on the EBI FTP server:

https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_35/

Download the MySQL dump file: chembl_35_mysql.tar.gz.
Decompress the .tar.gz file in a local directory of your choice.
Configure the path in the notebooks: You must provide the local file path to the SQL dump in the following two Jupyter notebooks:

1_pilot/note_ProbesDataset.ipynb
3_probesHQ/probes_hq_dataset.ipynb

In each notebook, find the configuration cell near the top and update the variable chemblpath for the ChEMBL path with the location of the file on your system.

Python 3.11.7

All scripts and notebooks in this project are written in Python. The code was developed and tested specifically with Python version 3.11.7 to ensure reproducibility and compatibility with the required packages. Using a different version may lead to unexpected errors.

PySpark 3.3.1

Some scripts use PySpark version 3.3.1 to load and work with large datasets.

Computational Resources

This pipeline is designed to process large-scale literature and evidence datasets. While PySpark is utilised for its efficiency, the computational and memory requirements remain substantial. Running the full analysis on a standard local machine is likely to be very slow or may fail due to insufficient memory.

For a timely and successful execution, we strongly recommend setting up the environment and running the scripts on a high-performance computing (HPC) cluster.

To provide a benchmark, the analysis for the manuscript was performed using the following resources. Users should aim for a comparable environment:

Memory (RAM): The PySpark steps are the most memory-intensive stages of the pipeline. A minimum of 250GB is recommended for the script in step 1 (1_get_ner_hq_probes.py).
CPU Cores: The pipeline was executed using 24 cores for script 1 and 6 for all others. Performance will scale with the number of available cores, particularly for the PySpark jobs.
Total Runtime: The full systematic pipeline took approximately 1.5 hours to complete on the hardware specified above

Installation

Clone the repository in your preferred local path

git clone https://gitlab.ebi.ac.uk/chembl/research/chemical_probes_lit.git
cd chemical_probes_lit
ls

You should be able to see the project content.

Create the python enviroment

It is recommended to create the python enviroment outside the git project. Go to your preferred path and create the enviroment.

cd your_path
python -m venv cprobesenv

Activate the enviroment

Mac or Linux:

source cprobesenv/bin/activate

Windows:

.\cprobesenv\Scripts\activate.bat

Install the required packages using the requirements.csv file

pip install -r requirements.txt

Execution

The systematic pipeline consists of 7 sequential Python scripts and a final Jupyter Notebook for plotting. It is critical to run these scripts in your python enviroment and in the specified order, as each step generates an intermediate file that serves as the input for the next. The sequential scripts have been conducted in separated steps to maximize intermediate analysis and speed up independent updates or modifications.

Pipeline steps

Get NER Data for HQ Probes

Script: 2_probesLit/1_get_ner_hq_probes.py.
Description: Processes the raw Open Targets literature data to extract all sentences containing mentions of the High-Quality (HQ) chemical probes.
Input: Raw literature data (--matches) and High Quality chemical probes dataset (--probes) .
Output: An intermediate file containing annotated data from articles mentioning at least one HQ probe in the label field (epmc_ner_results_hq_probes.tsv) and a final file containing annotated data from articles previously found in intermediate file (1_epmc_ner_results_hq_probes_all_sent.tsv).

Command to execute:
```
python chemical_probes_lit/2_probesLit/1_get_ner_hq_probes.py --matches [path_to_matches_folder] --probes [path_to_HQ_chemical_probes_dictionary_dataset]
```
Get P-T-D Triples

Script: 2_probesLit/2_get_triplets.py
Description: Takes the output from step 1 and identifies potential probe-target-disease (P-T-D) triples co-occurring within the same sentence.
Input: File from step 1 and HQ chemical probes dataset file
Output: An intermediate file containing raw P-T-D triples (2_ner_probes_triplets.csv).

Command to execute:
```
python chemical_probes_lit/2_get_triplets.py --input data/1_epmc_ner_results_hq_probes_all_sent.tsv --probes [path_to_HQ_chemical_probes_dictionary_dataset]
```
Filter Targets

Script: 2_probesLit/3_filter_targets.py
Description: Filters the triples from step 2, retaining only those where the identified target is a known, validated target for the specific chemical probe.
Input: Raw intermediate file with triples from step 2
Output: An intermediate file with high-confidence, filtered triples to include the exact validated target for each chemical probe (3_ner_probes_triplets_ptpairs.csv).

Command to execute:
```
python chemical_probes_lit/3_filter_targets.py --input data/2_ner_probes_triplets.csv --probes [path_to_HQ_chemical_probes_dictionary_dataset]
```
Get Open Targets Evidence

Script: 2_probesLit/4_get_OT_evidence.py
Description: This script retrieves all existing evidence for the T-D pairs from the main Open Targets evidence dataset.
Input: Dataset with filtered triples from step 3 and the OT evidence dataset.
Output: An intermediate file merging probe evidence with existing OT evidence (4_ner_probes_triplets_ptpairs_ev.csv).

Command to execute:
```
python chemical_probes_lit/4_get_OT_evidence.py --input data/3_ner_probes_triplets_ptpairs.csv --evidence [path_to_OT_evidence_folder]
```
Get Dated Open Targets Evidence

Script: 2_probesLit/5_get_OT_dated_evidence.py
Description: This script retrieves and merges the time-stamped (dated) evidence for the T-D pairs.
Input: Data from step 4 and the dated OT evidence dataset.
Output: The final, fully annotated dataset ready for analysis (5_ner_probes_triplets_ptpairs_evd.csv).

Command to execute:
```
python chemical_probes_lit/5_get_OT_dated_evidence.py --input data/4_ner_probes_triplets_ptpairs_ev.csv --datedevidence [path_to_OT_dated_evidence_folder]
```
Get Therapeutic area of disease term from Open Targets

Script: 2_probesLit/6_map_disease_therapeutic_area.py
Description: This script retrieves and merges the therapeutic area for the disease term. Input: Data from step 5 and the disease OT data.
Output: The annotated data set with the preferred therapeutic area id for the disease (6_ner_probes_triplets_ptpairs_evd_ta.csv).

Command to execute:
```
python chemical_probes_lit/6_map_disease_therapeutic_area.py --input data/5_ner_probes_triplets_ptpairs_evd.csv --otdisease [path_to_OT_disease_file]
```
Get the max clinical phase of drugs (from Open Targets) for the T-D pairs

Script: 2_probesLit/6_get_max_drug_phase.py
Description: This script retrieves and merges the max clinical phase for the T-D pair (e.g. 0,1,2,3,etc). Input: Data from step 6 and the known_drug OT data.
Output: The annotated data set with the max clinical phase found for the T-D pairs (7_ner_probes_triplets_ptpairs_dr.csv).

Command to execute:
```
python chemical_probes_lit/7_get_max_drug_phase.py --input data/6_ner_probes_triplets_ptpairs_evd_ta.csv --otdrug [path_to_OT_known_drug_file]
```
Plotting and Visualisation

File: 2_probesLit/6_plotting_data.ipynb
Description: This final Jupyter Notebook takes the processed data from the previous steps to generate the figures and tables presented in the manuscript. To run it, open and execute the cells sequentially in a Jupyter environment.

Licence

This project is licensed under the MIT License. This means you are free to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the software for any purpose, including commercial use.

For the full license text, please see the LICENSE file in this repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chemical Probes in Scientific Literature: A Pipeline for Target-Disease Evidence

Abstract

Project structure

Data Availability

Systematic Pipeline

Pilot Study

Systematic Pipeline Execution

Prerequisites

Input files

ChEMBL database

Python 3.11.7

PySpark 3.3.1

Computational Resources

Installation

Execution

Pipeline steps

Licence

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
1_pilot		1_pilot
2_probesLit		2_probesLit
3_probesHQ		3_probesHQ
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENCE		LICENCE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Chemical Probes in Scientific Literature: A Pipeline for Target-Disease Evidence

Abstract

Project structure

Data Availability

Systematic Pipeline

Pilot Study

Systematic Pipeline Execution

Prerequisites

Input files

ChEMBL database

Python 3.11.7

PySpark 3.3.1

Computational Resources

Installation

Execution

Pipeline steps

Licence

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages