Skip to content

chembl/chemical_probes_lit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

30 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Chemical Probes in Scientific Literature: A Pipeline for Target-Disease Evidence

This repository contains all the code and scripts necessary to reproduce the analysis presented in the manuscript: "Chemical Probes in Scientific Literature: Expanding and Validating Target-Disease Evidence." Preprint here

Abstract

This project leverages Natural Language Processing (NLP) to systematically analyse open-access scientific literature for evidence of novel target-disease (T-D) associations mediated by chemical probes. The pipeline identifies chemical probe-target-disease (P-T-D) triplets, quantifies their prevalence, and performs a temporal analysis to determine cases where probe-based evidence predates other non-literature evidence streams in platforms like Open Targets. The goal is to highlight the value of chemical probe literature as a resource for early-stage target validation and hypothesis generation in drug discovery.

Project structure

β”œβ”€β”€ πŸ“‚ 1_pilot/                               # Code and data for the initial pilot study.  
β”‚   β”œβ”€β”€ πŸ“‚ data/                              # Data files specific to the pilot.  
β”‚   β”œβ”€β”€ πŸ“‚ figs/                              # Figures generated during the pilot.  
β”‚   β”œβ”€β”€ πŸ“œ databases.py                       # Scripts for database interactions in the pilot.  
β”‚   └── πŸ““ note_ProbesDataset.ipynb           # Exploratory notebook for the pilot dataset.  
β”‚
β”œβ”€β”€ πŸ“‚ 2_probesLit/                           # Code and data for the main systematic analysis pipeline.  
β”‚   β”œβ”€β”€ πŸ“‚ data/                              # Intermediate and final datasets.  
β”‚   β”œβ”€β”€ πŸ“‚ figs/                              # Final figures for the manuscript.  
β”‚   β”œβ”€β”€ πŸ“‚ tables/                            # Final tables for the manuscript.  
|   β”œβ”€β”€ πŸ“œ 0_download_OT_files.py.            # Steo 0: download all open target files that are needed
β”‚   β”œβ”€β”€ πŸ“œ 1_get_ner_hq_probes.py             # Step 1: Fetches NER data for High-Quality (HQ) probes.  
β”‚   β”œβ”€β”€ πŸ“œ 2_get_triplets.py                  # Step 2: Extracts P-T-D triples from the NER data.  
β”‚   β”œβ”€β”€ πŸ“œ 3_filter_targets.py                # Step 3: Filters triples based on validated targets.  
β”‚   β”œβ”€β”€ πŸ“œ 4_get_OT_evidence.py               # Step 4: Retrieves existing evidence from Open Targets.  
β”‚   β”œβ”€β”€ πŸ“œ 5_get_OT_dated_evidence.py         # Step 5: Retrieves time-stamped evidence from Open Targets.  
β”‚   β”œβ”€β”€ πŸ“œ 6_map_disease_therapeutic_area.py  # Step 6: Retrieves therapeutic area (parent ontology id) from Open targets for the disease term 
|   β”œβ”€β”€ πŸ“œ 7_get_drug_max_phase.py            # Step 7: Obtains the max clinical phase (if any drug available) for the specific T-D pair
β”‚   β”œβ”€β”€ πŸ““ 8_plotting_data.ipynb              # Jupyter Notebook for final data visualisation.
|   └── πŸ“œ tools.py                           # Helper scripts for handling data.
β”‚
β”œβ”€β”€ πŸ“‚ 3_probesHQ/                            # Scripts and data to build the High-Quality (HQ) chemical probes dictionary.
β”‚   β”œβ”€β”€ πŸ“‚ files/                             # Raw data files for the HQ probe dictionary.
β”‚   β”œβ”€β”€ πŸ““ probes_hq_dataset.ipynb            # Notebook for creating and analysing the HQ probe dataset.
β”‚   └── πŸ“œ tools.py                           # Helper scripts for handling probe data.
β”‚
β”œβ”€β”€ πŸ“œ .gitignore                             # Specifies files and directories to be ignored by Git.
β”œβ”€β”€ βš–οΈ LICENCE                                # Licence especifications
β”œβ”€β”€ πŸ“œ requirements.txt                       # File with required packages to create the Python environment.  
└── πŸ“„ README.md                              # This file.

Data Availability

We are committed to transparency and reproducibility. This section outlines the availability of the data for both the main systematic pipeline and the preliminary pilot study.

Systematic Pipeline

The main systematic pipeline is fully reproducible. All scripts and code required to run the analysis are available in this GitLab repository. The public input files are detailed in the Input Files section. Due to their size, the intermediate and final datasets generated by the systematic pipeline are not stored on GitLab. They are publicly available for download from our dedicated repository at: [Insert Zenodo/Figshare/Repository URL here].

Pilot Study

The pilot study served as an initial proof-of-concept. To support our findings, all code, intermediate data, final results, figures, and plots from this preliminary analysis are publicly available directly in the 1_pilot/ directory of this repository. Please note that the initial, dictionary-based annotated dataset used as the starting point for the pilot is not included due to its legacy format. However, the most valuable output of this pilot workβ€”the manually curated bioactivity dataβ€”has been formally integrated into the public domain and is now available as part of the ChEMBL 36 database release.

Systematic Pipeline Execution

Prerequisites

Input files

  1. Annotated Literature

This is the primary dataset from the EuropePMC NER pipeline, containing entity matches from scientific articles. The analysis focuses exclusively on successful "matches"β€”data where entities have been grounded to a standard Open Targets identifier.

- Public Access: The required successful matches are available on the Open Targets FTP server
https://ftp.ebi.ac.uk/pub/databases/opentargets/platform/latest/intermediate/literature_match/
- Internal Access (for development): The full dataset including "FailedMatches" (version 24.09) is available to Open Targets partners at 
gs://open-targets-pre-data-releases/24.09/output/etl/parquet/literature/matches
  1. Open Targets Evidence

This dataset contains all existing target-disease association evidence curated by Open Targets.

- Public Access: http://ftp.ebi.ac.uk/pub/databases/opentargets/platform/25.09/output/association_by_datasource_indirect
  1. Dated Open Targets Evidence

This dataset provides time stamps for the evidence, which is crucial for the temporal analysis.

- Public Access: A public version is available at Zenodo: https://zenodo.org/records/15922783
  1. Therapeutic area (parent ontology id) for the disease term assigned by Open Targets.

  2. Maximum clinical phase if any drug for the T-D pairs.

ChEMBL database

This project requires a local copy of the ChEMBL database (version 35) to extract chemical probe synonyms and other relevant metadata.

  • Navigate to the ChEMBL 35 release page on the EBI FTP server:
https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_35/
  • Download the MySQL dump file: chembl_35_mysql.tar.gz.
  • Decompress the .tar.gz file in a local directory of your choice.
  • Configure the path in the notebooks: You must provide the local file path to the SQL dump in the following two Jupyter notebooks:
1_pilot/note_ProbesDataset.ipynb
3_probesHQ/probes_hq_dataset.ipynb

In each notebook, find the configuration cell near the top and update the variable chemblpath for the ChEMBL path with the location of the file on your system.

Python 3.11.7

All scripts and notebooks in this project are written in Python. The code was developed and tested specifically with Python version 3.11.7 to ensure reproducibility and compatibility with the required packages. Using a different version may lead to unexpected errors.

PySpark 3.3.1

Some scripts use PySpark version 3.3.1 to load and work with large datasets.

Computational Resources

This pipeline is designed to process large-scale literature and evidence datasets. While PySpark is utilised for its efficiency, the computational and memory requirements remain substantial. Running the full analysis on a standard local machine is likely to be very slow or may fail due to insufficient memory.

For a timely and successful execution, we strongly recommend setting up the environment and running the scripts on a high-performance computing (HPC) cluster.

To provide a benchmark, the analysis for the manuscript was performed using the following resources. Users should aim for a comparable environment:

  • Memory (RAM): The PySpark steps are the most memory-intensive stages of the pipeline. A minimum of 250GB is recommended for the script in step 1 (1_get_ner_hq_probes.py).
  • CPU Cores: The pipeline was executed using 24 cores for script 1 and 6 for all others. Performance will scale with the number of available cores, particularly for the PySpark jobs.
  • Total Runtime: The full systematic pipeline took approximately 1.5 hours to complete on the hardware specified above

Installation

  1. Clone the repository in your preferred local path
git clone https://gitlab.ebi.ac.uk/chembl/research/chemical_probes_lit.git
cd chemical_probes_lit
ls

You should be able to see the project content.

  1. Create the python enviroment

It is recommended to create the python enviroment outside the git project. Go to your preferred path and create the enviroment.

cd your_path
python -m venv cprobesenv
  1. Activate the enviroment

Mac or Linux:

source cprobesenv/bin/activate

Windows:

.\cprobesenv\Scripts\activate.bat
  1. Install the required packages using the requirements.csv file
pip install -r requirements.txt

Execution

The systematic pipeline consists of 7 sequential Python scripts and a final Jupyter Notebook for plotting. It is critical to run these scripts in your python enviroment and in the specified order, as each step generates an intermediate file that serves as the input for the next. The sequential scripts have been conducted in separated steps to maximize intermediate analysis and speed up independent updates or modifications.

Pipeline steps

  1. Get NER Data for HQ Probes

    Script: 2_probesLit/1_get_ner_hq_probes.py.
    Description: Processes the raw Open Targets literature data to extract all sentences containing mentions of the High-Quality (HQ) chemical probes.
    Input: Raw literature data (--matches) and High Quality chemical probes dataset (--probes) .
    Output: An intermediate file containing annotated data from articles mentioning at least one HQ probe in the label field (epmc_ner_results_hq_probes.tsv) and a final file containing annotated data from articles previously found in intermediate file (1_epmc_ner_results_hq_probes_all_sent.tsv).

    Command to execute:

    python chemical_probes_lit/2_probesLit/1_get_ner_hq_probes.py --matches [path_to_matches_folder] --probes [path_to_HQ_chemical_probes_dictionary_dataset]
    
  2. Get P-T-D Triples

    Script: 2_probesLit/2_get_triplets.py
    Description: Takes the output from step 1 and identifies potential probe-target-disease (P-T-D) triples co-occurring within the same sentence.
    Input: File from step 1 and HQ chemical probes dataset file
    Output: An intermediate file containing raw P-T-D triples (2_ner_probes_triplets.csv).

    Command to execute:

    python chemical_probes_lit/2_get_triplets.py --input data/1_epmc_ner_results_hq_probes_all_sent.tsv --probes [path_to_HQ_chemical_probes_dictionary_dataset]
    
  3. Filter Targets

    Script: 2_probesLit/3_filter_targets.py
    Description: Filters the triples from step 2, retaining only those where the identified target is a known, validated target for the specific chemical probe.
    Input: Raw intermediate file with triples from step 2
    Output: An intermediate file with high-confidence, filtered triples to include the exact validated target for each chemical probe (3_ner_probes_triplets_ptpairs.csv).

    Command to execute:

    python chemical_probes_lit/3_filter_targets.py --input data/2_ner_probes_triplets.csv --probes [path_to_HQ_chemical_probes_dictionary_dataset]
    
  4. Get Open Targets Evidence

    Script: 2_probesLit/4_get_OT_evidence.py
    Description: This script retrieves all existing evidence for the T-D pairs from the main Open Targets evidence dataset.
    Input: Dataset with filtered triples from step 3 and the OT evidence dataset.
    Output: An intermediate file merging probe evidence with existing OT evidence (4_ner_probes_triplets_ptpairs_ev.csv).

    Command to execute:

    python chemical_probes_lit/4_get_OT_evidence.py --input data/3_ner_probes_triplets_ptpairs.csv --evidence [path_to_OT_evidence_folder]
    
  5. Get Dated Open Targets Evidence

    Script: 2_probesLit/5_get_OT_dated_evidence.py
    Description: This script retrieves and merges the time-stamped (dated) evidence for the T-D pairs.
    Input: Data from step 4 and the dated OT evidence dataset.
    Output: The final, fully annotated dataset ready for analysis (5_ner_probes_triplets_ptpairs_evd.csv).

    Command to execute:

    python chemical_probes_lit/5_get_OT_dated_evidence.py --input data/4_ner_probes_triplets_ptpairs_ev.csv --datedevidence [path_to_OT_dated_evidence_folder]
    
  6. Get Therapeutic area of disease term from Open Targets

    Script: 2_probesLit/6_map_disease_therapeutic_area.py
    Description: This script retrieves and merges the therapeutic area for the disease term. Input: Data from step 5 and the disease OT data.
    Output: The annotated data set with the preferred therapeutic area id for the disease (6_ner_probes_triplets_ptpairs_evd_ta.csv).

    Command to execute:

    python chemical_probes_lit/6_map_disease_therapeutic_area.py --input data/5_ner_probes_triplets_ptpairs_evd.csv --otdisease [path_to_OT_disease_file]
    
  7. Get the max clinical phase of drugs (from Open Targets) for the T-D pairs

    Script: 2_probesLit/6_get_max_drug_phase.py
    Description: This script retrieves and merges the max clinical phase for the T-D pair (e.g. 0,1,2,3,etc). Input: Data from step 6 and the known_drug OT data.
    Output: The annotated data set with the max clinical phase found for the T-D pairs (7_ner_probes_triplets_ptpairs_dr.csv).

    Command to execute:

    python chemical_probes_lit/7_get_max_drug_phase.py --input data/6_ner_probes_triplets_ptpairs_evd_ta.csv --otdrug [path_to_OT_known_drug_file]
    
  8. Plotting and Visualisation

    File: 2_probesLit/6_plotting_data.ipynb
    Description: This final Jupyter Notebook takes the processed data from the previous steps to generate the figures and tables presented in the manuscript. To run it, open and execute the cells sequentially in a Jupyter environment.

Licence

This project is licensed under the MIT License. This means you are free to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the software for any purpose, including commercial use.

For the full license text, please see the LICENSE file in this repository.

Citation

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors