A powerful document parsing library that extracts structured content from PDF, DOCX, and Markdown files. DocuMentor uses a layout-based approach with OCR capabilities to parse documents into a unified hierarchical structure.
The DocuMentor library is designed to simplify and automate the parsing and semantic analysis of various types of documents, including PDF, DOCX, and Markdown files.
The library performs the following tasks:
- Data extraction - Extract structured content from documents
- Document structure analysis - Hierarchical analysis of document structure
- Entity recognition - Identify and classify document elements (headers, tables, images, formulas)
- Format conversion - Unified output format across all supported document types
- Multi-format Support: Parse PDF, DOCX, and Markdown documents
- Layout-based Parsing: Uses OCR for intelligent layout detection (default: Dots.OCR)
- Custom OCR Components: Replace any OCR component with your own implementation
- Smart OCR Integration:
- Different prompts for different PDF types (scanned vs text-extractable)
- Text extraction from OCR for scanned PDFs (default: Dots OCR)
- PyMuPDF extraction for text-extractable PDFs
- Structured Output: Unified hierarchical structure across all formats
- Table Extraction:
- PDF: Parsing from OCR HTML (default: Dots OCR)
- DOCX: Conversion from XML to Pandas DataFrames
- Markdown: Automatic conversion to DataFrames
- Formula Extraction: Extracts formulas in LaTeX format (default: Dots OCR)
- Image Handling: Extracts and links images with captions
- Advanced Header Detection:
- DOCX: OCR + XML + TOC validation + rules-based missing header detection
- PDF: Layout-based with level analysis
- Markdown: Regex-based parsing
- Caption Finding: Automatic caption detection for tables and images (DOCX)
- Table Structure Matching: Validates tables by comparing OCR and XML structures (DOCX)
- LangChain Integration: Compatible with LangChain Document format
- Modular Architecture: Easy to extend and customize
For installation from the source code, you need to have the poetry package manager installed (poetry).
poetry installAlternatively, you can use pip:
pip install -r requirements.txtBefore running DocuMentor, you need to:
Create a .env file in the project root (use docs/env.example or examples/env.example as a template) with the following required variables:
Required OCR variables:
DOTS_OCR_BASE_URL- Base URL for Dots OCR serviceDOTS_OCR_API_KEY- API key for authenticationDOTS_OCR_MODEL_NAME- Model name to use
Optional variables:
DOTS_OCR_TEMPERATURE,DOTS_OCR_MAX_TOKENS,DOTS_OCR_TIMEOUTOCR_MAX_IMAGE_SIZE,OCR_MIN_CONFIDENCE
Important: Never commit .env to version control. Store API keys securely.
DocuMentor requires Dots OCR to be deployed as a vLLM server before use. You can deploy it using Docker Compose (see Docker Deployment section below) or manually using vLLM.
Using Docker Compose (Recommended):
# Use the provided compose.yml from examples/
cd examples
docker-compose -f compose.yml up -d
# Check logs
docker-compose -f compose.yml logs -f dots-ocrSee examples/compose.yml for the complete Docker Compose configuration.
Manual vLLM deployment:
# Example vLLM command
python -m vllm.entrypoints.openai.api_server \
--model /path/to/your/model \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.184 \
--max-model-len 65536 \
--api-key ${DOTS_OCR_API_KEY} \
--trust-remote-codeFor detailed vLLM integration instructions, see examples/README_vllm.md.
Make sure the DOTS_OCR_BASE_URL in your .env file points to the running vLLM server.
from langchain_core.documents import Document
from documentor import Pipeline
# Initialize pipeline
pipeline = Pipeline()
# Create a document
doc = Document(
page_content="",
metadata={"source": "path/to/document.pdf"}
)
# Parse the document
parsed_doc = pipeline.parse(doc)
# Access elements
for element in parsed_doc.elements:
print(f"{element.type}: {element.content[:100]}")Documentor supports flexible configuration options. You can use:
- Default internal config (used automatically if not specified)
- External config file (recommended for production)
- Config dictionary (useful for programmatic configuration)
Copy example config files from examples/config/ to your project:
from documentor.processing.parsers.pdf import PdfParser
from langchain_core.documents import Document
# Use custom config file
parser = PdfParser(config_path="/path/to/your/config.yaml")
doc = Document(page_content="", metadata={"source": "document.pdf"})
parsed = parser.parse(doc)from documentor.processing.parsers.pdf import PdfParser
from langchain_core.documents import Document
# Use custom config dictionary
parser = PdfParser(config_dict={
"pdf_parser": {
"layout_detection": {
"render_scale": 3.0 # Higher quality OCR
},
"processing": {
"skip_title_page": True
}
}
})
doc = Document(page_content="", metadata={"source": "document.pdf"})
parsed = parser.parse(doc)When both config_path and config_dict are provided, config_dict takes priority.
See examples/config/README.md for detailed configuration options.
from documentor.processing.parsers.pdf import PdfParser
from documentor.ocr.base import BaseLayoutDetector
from langchain_core.documents import Document
# Define custom layout detector
class MyLayoutDetector(BaseLayoutDetector):
def detect_layout(self, image, origin_image=None):
# Your custom OCR implementation
return [...]
# Use custom component with custom config
parser = PdfParser(
layout_detector=MyLayoutDetector(),
config_path="/path/to/your/config.yaml"
)
doc = Document(page_content="", metadata={"source": "document.pdf"})
parsed = parser.parse(doc)See CUSTOM_COMPONENTS_GUIDE.md for detailed instructions.
For a complete example of custom OCR implementation, see examples/custom_ocr_example.py.
- Pipeline: Main entry point for document processing
- Parsers: Format-specific parsers (PDF, DOCX, Markdown)
- OCR: OCR services integration (Dots.OCR)
- Domain Models: Unified data structures (Element, ParsedDocument)
-
PDF Parser:
- Layout-based parsing with specialized processors
- Different prompts for scanned vs text-extractable PDFs
- Table parsing from OCR HTML (default: Dots OCR)
- Formula extraction in LaTeX format (default: Dots OCR)
- Specialized processors: layout, text, tables, images, hierarchy
- Custom Components: Replace any OCR component with your own implementation
-
DOCX Parser:
- Combined approach (OCR + XML + TOC parsing)
- Rules-based missing header detection
- Caption finding for tables and images
- Table structure matching (OCR vs XML)
-
Markdown Parser:
- Regex-based parsing (no LLM or OCR)
- Table extraction to DataFrames
- Nested list support with proper hierarchy
- Input:
pdf,docx,md(Markdown) - Output: Structured
ParsedDocumentwith hierarchical elements
Note: For DOC files, please convert them to DOCX format first using Microsoft Word, LibreOffice, or online converters before processing.
Configuration files are provided as examples in examples/config/:
config.yaml: Main configuration file (containspdf_parseranddocx_parsersections)llm_config.yaml: LLM service configurationocr_config.yaml: OCR service configuration
Note: Internal config files in documentor/config/ are kept for backward compatibility but should not be modified. Copy example configs from examples/config/ to your project and pass the path when initializing parsers.
.env is auto-loaded by documentor/core/load_env.py. Use docs/env.example or examples/env.example as a template.
Required OCR variables:
DOTS_OCR_BASE_URL,DOTS_OCR_API_KEY,DOTS_OCR_MODEL_NAME
Optional:
DOTS_OCR_TEMPERATURE,DOTS_OCR_MAX_TOKENS,DOTS_OCR_TIMEOUTOCR_MAX_IMAGE_SIZE,OCR_MIN_CONFIDENCE
Important: Never commit .env to version control. Store API keys securely.
documentor/
├── documentor/ # Main library package
│ ├── config/ # Internal default config files (do not modify)
│ ├── core/ # Core utilities (environment loading)
│ ├── domain/ # Domain models (Element, ParsedDocument)
│ ├── exceptions.py # Custom exceptions
│ ├── ocr/ # OCR services integration
│ │ ├── base.py # Base classes for OCR components
│ │ ├── dots_ocr/ # Dots.OCR implementation (default)
│ │ └── manager.py # DotsOCRManager
│ ├── pipeline.py # Main pipeline class
│ └── processing/ # Document processing modules
│ ├── loader/ # Document loading utilities
│ └── parsers/ # Format-specific parsers
│ ├── docx/ # DOCX parser modules
│ │ ├── docx_parser.py
│ │ ├── layout_detector.py
│ │ ├── header_processor.py
│ │ ├── header_finder.py
│ │ ├── caption_finder.py
│ │ ├── hierarchy_builder.py
│ │ ├── xml_parser.py
│ │ ├── toc_parser.py
│ │ └── converter_wrapper.py
│ ├── md/ # Markdown parser modules
│ │ ├── md_parser.py
│ │ ├── tokenizer.py
│ │ └── hierarchy.py
│ └── pdf/ # PDF parser modules
│ ├── pdf_parser.py
│ ├── layout_processor.py
│ ├── text_extractor.py
│ ├── table_parser.py
│ ├── image_processor.py
│ ├── hierarchy_builder.py
│ └── ocr/ # OCR components
├── docs/ # Documentation
├── examples/ # Example configurations and code
├── images/ # Diagrams and images
└── experiments/ # Experimental code and metrics
If you're using Dots OCR as the default OCR service, you can deploy it using Docker Compose. We provide ready-to-use configuration files in the examples/ directory.
Quick Start:
# Navigate to examples directory
cd examples
# Start the service
docker-compose -f compose.yml up -d
# Check logs
docker-compose -f compose.yml logs -f dots-ocr
# Stop the service
docker-compose -f compose.yml downConfiguration Files:
- Docker Compose: examples/compose.yml - Complete Docker Compose configuration for Dots OCR
- Dockerfile: examples/Dockerfile.dotsocr - Custom Dockerfile for building Dots OCR image
- Entrypoint Script: examples/entrypoint.sh - Entrypoint script for Docker container
Important Security Notes:
- Store API keys in environment variables (
.envfile) or use Docker secrets - Never commit API keys or sensitive paths to version control
- Adjust GPU settings (
CUDA_VISIBLE_DEVICES) based on your hardware - Modify memory limits and GPU utilization based on your system resources
Environment Variables:
Create a .env file in the project root using examples/env.example as a template:
DOTS_OCR_BASE_URL=http://localhost:8069/v1
DOTS_OCR_API_KEY=your-secure-api-key-here
DOTS_OCR_MODEL_NAME=/modelFor detailed vLLM integration and deployment instructions, see examples/README_vllm.md.
All parsers return a unified ParsedDocument structure:
ParsedDocument(
source: str,
format: DocumentFormat,
elements: List[Element],
metadata: Dict[str, Any]
)Each Element contains:
id: Unique identifiertype: Element type (HEADER_1-6, TEXT, TABLE, IMAGE, etc.)content: Element contentparent_id: Parent element ID (for hierarchy)metadata: Additional metadata (bbox, page_num, dataframe, etc.)
- vLLM integration - Detailed guide for vLLM server setup and integration
- Environment template - Example
.envfile with all required variables - Custom Components Guide - How to create custom OCR components
- Configuration Guide - Detailed configuration options
- Custom OCR Example - Complete example of custom OCR implementation
- Docker Compose - Docker Compose configuration for Dots OCR
- Dockerfile - Dockerfile for building Dots OCR container
By ITMO University, Saint Petersburg, Russia
Questions and suggestions can be asked to the maintainers:
