Skip to content

cns-iu/hra-ftu-rag-supporting-information

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI-Supported Extraction of Functional Tissue Unit Properties for Human Reference Atlas Construction


The Human Reference Atlas (HRA) effort brings together experts from more than 25 international consortia to capture the multiscale organization of the human body—from large anatomical organ systems (macro) to the single-cell level (micro). Functional tissue units (FTU, meso) in 10 organs have been detailed and 2D illustrations have been created by experts. Comprehensive FTU property characterization is essential for the HRA, but manual review of the vast number of scholarly publications is impractical. Here, we introduce Large-Model Retrieval-Augmented Generation for HRA FTUs (HRAftu-LM-RAG), an AI-driven framework for scalable and automated extraction of FTU-relevant properties from scholarly publications. This validated framework integrates Large Language Models for textual reasoning, Large Vision Models for visual interpretation, and Retrieval Augmented Generation for knowledge grounding, offering a balanced trade-off between accuracy and processing efficiency. We retrieved 244,640 PubMed Central publications containing 1,389,168 figures for 22 FTUs and identified 617,237 figures with microscopy and schematic images. From these images and associated text, we automatically extracted 331,189 scale bars and 1,719,138 biological entity mentions, along with donor metadata such as sex and age.

Introduction

This repository provides the supporting code and data for "AI-Supported Extraction of Functional Tissue Unit Properties for Human Reference Atlas Construction" paper, detailing the robust and scalable computation of scholarly evidence for the size, structure, and demographic differences of FTUs, facilitating the design and approval of future FTU illustrations during HRA construction.

Workflow Diagram

Example Image

Workflow Summary

Workflow Input Algorithm/Script Output
WF1: Image-type categorization - ftu_pub_pmc table: pmcid, graphic, caption, label, file_path
- image_refs table: ref_text
2-3-itype-run.py - Raw outputs from vision models: llama, llava, phi3, phi35, pixtral
- Final image-type labels stored in vision_llm table: micro, statis, schema, 3d, chem, math
WF2: In-image text-term extraction - image_node_lvm_total table: pmcid, graphic, file_path 3-2-lvm-entity-run.py - Extracted in-image text terms stored in image_node_lvm_total table under nodes field
WF3: Scale-bar extraction - ftu_pub_pmc table: pmcid, graphic, caption
- image_refs table: ref_text
- Filter: vision_llm.micro = "Yes"
4-2-sb-run.py - Scale bar information stored in scale_bar_all_info table:
Descriptor Type, Value, Units, Notes, Panel
WF4: Donor-metadata extraction - ftu_pub_pmc table: pmcid, graphic, caption
- publication_summary table: abstract
- Filter: vision_llm.micro = "Yes" or vision_llm.schema = "Yes"
- image_node_lvm_total table: nodes
5-1-donor-run.py - Donor metadata stored in donor_meta_all_info table:
species, sex, age, BMI, height, weight
WF5: AS+CT+B extraction - ftu_pub_pmc table: pmcid, graphic, caption
- image_node_lvm_total table: nodes
- Filter: vision_llm.micro = "Yes" or vision_llm.schema = "Yes"
6-1-bio-onto-run.py - Biological entity terms stored in bio_onto_all_info table:
entity (AS, CT, B)

Repository Structure

├── data      # Input and output datasets, plus test data for validation
├── docs      # Detailed documentation (architecture, deployment, usage)
├── src       # Source code for data fetching, LLM/Vision pipelines, similarity scripts
├── vis       # Generated SVG figures used in the paper
  • data/: Store raw and processed input data, output results, and any test datasets required to reproduce experiments.

  • docs/: Markdown guides covering architecture (architecture.md), installation (installation.md), LLM deployment (llm_deployment.md), vision-language model deployment (lvm_deployment.md), and usage (usage.md).

  • src/: Organized into submodules:

    • data-fetching/: Scripts for retrieving and preprocessing data from BioPortal, OLS, and PMC, including fetching FTU descriptions, downloading and extracting PMC articles, and extracting image paths and metadata.
    • lm-rag/: Integrated Large Language Models (LLMs) and Large Vision Models (LVMs) pipelines covering Retrieval-Augmented Generation (RAG), evaluations and batch running of FTU-related tasks, including image-type classification, image-entity extraction, scale bar extraction, donor metadata extraction and biology entities extraction.
    • process-donor/ & process-scale-bar/: Post-processing scripts for cleaning and standardizing donor metadata metrics and extracted scale bar values.
  • vis/: Contains SVG images from analysis results, presented in the paper.

Installation

For full installation instructions, environment setup, and dependency management, see docs/installation.md.

Quick Start:

  1. Clone the repository:

    git clone https://github.com/cns-iu/hra-ftu-rag-supporting-information.git
    cd HRAftu-LM-RAG
  2. Follow the steps in docs/installation.md to install system dependencies (Docker, Python, CUDA), set up containers, and verify services.

Usage

Usage examples, command-line options, and configuration details are provided in docs/usage.md.

Common Workflows:

  • Data Import: Load PDF, TXT, DOCX files into FastGPT collections.
  • FTU Extraction: Run LLM-RAG pipelines to extract scale bars and biological entities from images.
  • Similarity Evaluation: Execute similarity scripts to compare model outputs with ground truth.

Documentation

Access architectural diagrams, deployment guides, and model-specific instructions in the docs/ folder:

  • Architecture: docs/architecture.md Provides an overview of the system architecture, including module interactions, data flow diagrams, and component responsibilities.
  • LLM Deployment: docs/llm_deployment.md Step-by-step instructions for environment setup, dependency installation, Docker and CUDA configuration, and verification tests.
  • Vision-Language Model Deployment: docs/lvm_deployment.md Demonstrates common workflows with example commands, configuration parameters, and expected outputs for core functionalities such as data import, FTU extraction, and similarity evaluation.
  • Installation Guide: docs/installation.md Covers deployment of Large Language Models, including model selection criteria, API configuration, and performance optimization strategies.
  • Usage Guide: docs/usage.md Explains the vision-language model pipeline setup, detailing image preprocessing, model integration, and inference procedures.

Data

The data/ directory contains the following subdirectories and files:

  • bio-onto/

    • bio-onto-prompt.csv: Prompts for biological ontology extraction.
    • bio-onto-test-answer.csv: Expected answers for ontology tests.
    • selected_prompt.txt: The chosen prompt template.
  • donor-meta/

    • prompt-donor.csv: Prompts for donor metadata extraction.
    • donor-test-answer.csv: Expected donor metadata answers.
    • selected_prompt.txt: Chosen prompt template.
    • age/, age_yearold/, bmi/, sex/, species/: Each contains <metric>_1.csv (sample data) and round.csv (processing scripts for rounding values).
  • emb/

    • test_questions.csv: Embedding-based similarity test questions.
  • img-entity/

    • lvm-entity-prompt.csv: Prompts for LVM-based entity extraction.
    • lvm-entity-testdata.tar.gz: Compressed test images for entity extraction.
    • lvm-test-answer.csv: Expected outputs for image-entity tasks.
    • selected-prompt.txt: The chosen image-entity prompt template.
  • img-type/

    • prompt.txt: Prompt template for image type classification.
    • test_img_info.json.gz: JSON test file with image metadata.
    • img-type-test-answer.csv: Expected classification results.
  • input-data/

    • 0-0-ftu-pmc-total.tar.gz: Complete FTU–PMC dataset archive.
    • 0-1-oa-comm-ftu-pmcid-filepath.tar.gz: Subset with FTU–PMC ID to filepath mappings.
    • ftu-description-from-bioportal.csv: FTU descriptions imported from BioPortal.
    • organ-ftu-uberon.csv: Mapping of organs to UBERON FTU terms.
    • ftu-pmc-manual/: Manually curated PMC search results for 22 FTUs, e.g., pmc_result_<FTU name>.txt.
  • scale-bar/

    • scale-bar-prompts.csv: Prompts for scale bar detection and extraction.
    • scale-bar-sample.csv: Sample output file illustrating extracted scale bar values.
    • selected_prompt.txt: The chosen scale-bar prompt template.
    • units-expression/: CSV files for unit normalization (e.g., um.csv, cm.csv, m.csv, etc.).
  • vis-source-data/

    • Source datasets (CSV, JSON, spreadsheets) used to generate the SVG figures in vis/. Follow the processing pipelines in docs/usage.md to regenerate visualizations.

Note: Due to size constraints, raw data files are not stored in this repository. Please follow the data preparation steps in docs/installation.md to download and extract the required archives.

Visualization

The vis/ directory holds SVG images generated from analysis results, matching the figures in the paper. These include scale bar overlays, FTU schematics, and demographic plots.

License

This project is licensed under the MIT License. See LICENSE for details.

About

Supporting information page for "AI-Driven Functional Tissue Unit Characteristics Discovery for the Human Reference Atlas"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors