Skip to content

sciknoworg/tabulus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Tabulus logo

πŸ“š Tabulus: Scientific PDF Table Extraction Pipeline

Pipeline

πŸ” Overview

Tabulus is a modular multi-stage pipeline for extracting structured table data from scientific PDF documents.

The system combines document analysis, OCR, bibliography extraction, reference matching, and DOI enrichment into a unified workflow that transforms scientific publications into machine-readable data suitable for further analysis, knowledge graph integration, and research evaluation.

The project was developed as part of a Master's thesis investigating scientific table extraction, OCR benchmarking, bibliography-aware processing, and structured scholarly knowledge extraction.


✨ Features

πŸ“„ Scientific Table Extraction

  • Automated table detection from scientific PDFs
  • Table cropping and preprocessing
  • OCR-based table reconstruction
  • Structured CSV generation

πŸ”— Bibliography-Aware Processing

  • Automatic reference table detection
  • Bibliography extraction from full publications
  • Reference matching between tables and bibliography entries
  • DOI enrichment using Crossref

πŸ“Š Research & Evaluation

  • OCR benchmarking framework
  • RMS-based table similarity evaluation
  • Precision, Recall, and F1-score analysis
  • Runtime benchmarking
  • Reproducible evaluation workflows

πŸ—οΈ System Design

  • Modular microservice architecture
  • REST-based communication
  • Docker deployment
  • GPU-accelerated OCR support
  • Interactive web interface

βš™οΈ Pipeline Workflow

Scientific PDF
      ↓
MinerU Table Detection
      ↓
Table Cropping
      ↓
OCR Extraction
(PaddleOCR-VL, DeepSeek OCR, Chandra OCR, Kreuzberg OCR, NuExtract3)
      ↓
Reference Table Detection
      ↓
Bibliography Extraction
(GROBID / Kreuzberg + Regex)
      ↓
Reference Matching
      ↓
Crossref DOI Resolution
      ↓
Enriched CSV Generation
      ↓
Interactive Visualization UI

πŸ“ Repository Structure

tabulus/
β”‚
β”œβ”€β”€ assets/
β”‚   β”œβ”€β”€ img/
β”‚   └── logo.png
β”‚
β”œβ”€β”€ dataset/
β”‚   └── README.md
β”‚
β”œβ”€β”€ evaluation/
β”‚   β”œβ”€β”€ deplot/
β”‚   β”œβ”€β”€ new_results/
β”‚   β”œβ”€β”€ plots/
β”‚   β”‚   β”œβ”€β”€ reference_extraction/
β”‚   β”‚   β”œβ”€β”€ scripts/
β”‚   β”‚   └── table_extraction/
β”‚   β”œβ”€β”€ scripts/
β”‚   └── README.md
β”‚
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ ocr_models/
β”‚   β”‚   β”œβ”€β”€ components/
β”‚   β”‚   β”‚   β”œβ”€β”€ deepseekOCR2/
β”‚   β”‚   β”‚   β”œβ”€β”€ Kreuzberg/
β”‚   β”‚   β”‚   β”œβ”€β”€ mineru_service/
β”‚   β”‚   β”‚   β”œβ”€β”€ NuExtract3/
β”‚   β”‚   β”‚   └── paddleOCR_VL/
β”‚   β”‚   β”œβ”€β”€ KISSKI/
β”‚   β”‚   β”‚   β”œβ”€β”€ Chandra/
β”‚   β”‚   β”‚   └── NuExtract3/
β”‚   β”‚   β”œβ”€β”€ runners/
β”‚   β”‚   └── README.md
β”‚   β”‚
β”‚   └── Tabulus/
β”‚       β”œβ”€β”€ backend/
β”‚       β”œβ”€β”€ kreuzberg_service/
β”‚       β”œβ”€β”€ mineru_service/
β”‚       β”œβ”€β”€ paddleocr_service/
β”‚       β”œβ”€β”€ ui_input/
β”‚       β”œβ”€β”€ docker-compose.yml
β”‚       └── README.md
β”‚
β”œβ”€β”€ .gitignore
β”œβ”€β”€ LICENSE
β”œβ”€β”€ README.md
└── requirements.txt

🧩 Main Components

Component Purpose
src/Tabulus Complete production pipeline
src/ocr_models OCR services, runners, and benchmarking components
evaluation Evaluation scripts, metrics, and visualizations
dataset Benchmark dataset documentation and ground-truth structure
assets Images and visual resources used in the documentation

Detailed documentation for each component is available in the corresponding README files.


πŸ€– OCR Technologies

The project evaluates and integrates multiple OCR and document understanding approaches:

  • MinerU
  • PaddleOCR-VL
  • DeepSeek OCR 2
  • Chandra OCR
  • Kreuzberg OCR
  • NuExtract3
  • GROBID

πŸ—„οΈ Dataset

The project uses a manually curated evaluation dataset containing:

  • scientific publications,
  • annotated tables,
  • bibliography references,
  • OCR outputs,
  • DOI matching results,
  • evaluation metrics.

The complete dataset exceeds 700 MB and is distributed separately.

See:

dataset/README.md

for details.


πŸ“ˆ Evaluation

A comprehensive evaluation framework is included for analyzing:

  • table extraction quality,
  • OCR robustness,
  • bibliography extraction performance,
  • reference matching accuracy,
  • DOI enrichment quality,
  • runtime efficiency.

Generated benchmark plots and visualizations are available in:

evaluation/plots/

See:

evaluation/README.md

for detailed documentation.


πŸš€ Running the Pipeline

πŸ“‹ Prerequisites

  • Docker Desktop
  • Docker Compose
  • Python 3.11+
  • NVIDIA GPU with CUDA support, recommended for OCR models

▢️ Start All Services

Navigate to the final pipeline folder:

cd src/Tabulus

Start the services:

docker compose up --build

This command starts:

  • Frontend UI
  • Backend API
  • MinerU Service
  • PaddleOCR-VL Service
  • Kreuzberg OCR Service

After startup, the web interface can be accessed through the browser.


πŸ“– Documentation

Additional documentation is available in:

src/Tabulus/README.md
src/ocr_models/README.md
evaluation/README.md
dataset/README.md

Each README contains detailed setup instructions, implementation details, API documentation, evaluation procedures, and usage examples.


πŸŽ“ Research Context

This repository accompanies a Master's thesis focused on:

  • scientific table extraction,
  • OCR benchmarking,
  • bibliography-aware table processing,
  • DOI enrichment,
  • structured scientific knowledge extraction,
  • reproducible research workflows.

πŸ“‘ Citation

If you use this repository in your research, please cite the associated Master's thesis.

Citation information will be added after publication.


πŸ“œ License

This project is provided for research and educational purposes.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors