Automated data extraction from PDF invoices using LayoutLMv3 (Microsoft) -- a multimodal transformer that combines text, spatial layout and visual features for token classification. Also supports LayoutLM v1 for comparison.
The system performs OCR on invoice PDFs and uses a fine-tuned LayoutLM model to identify and extract key fields:
| Field | Description |
|---|---|
| VENDOR | Supplier / issuing company name |
| CUSTOMER | Recipient / billing company name |
| DATE | Invoice date |
| TOTAL | Total amount |
| INVOICE_NUMBER | Invoice reference number |
PDF Invoice
|
v
[pdf2image] --> Page images
|
v
[Tesseract OCR] --> Words + bounding boxes
|
v
[LayoutLM] --> Token classification (B-VENDOR, I-VENDOR, B-TOTAL, ...)
|
v
[Post-processing] --> Structured data (heuristics, validation, label matching)
|
v
Excel output
The extraction pipeline uses a model-first approach:
- Model prediction -- LayoutLM token classification on OCR output (primary source of truth)
- Positional heuristics -- scoring adjustments based on where entities appear in the document
- Fallback strategies (only for fields the model did not find):
- Enhanced total detection near total-related keywords
- Direct label-value matching for known patterns (multilingual)
- Rule-based validation (date, amount, invoice number format checks)
The file invoice-analyzer/invoice_config.json controls paths and hyperparameters:
| Key | Description |
|---|---|
POPPLER_PATH |
Path to Poppler binaries (auto-detected if in PATH) |
TESSERACT_PATH |
Path to Tesseract binary (auto-detected if in PATH) |
BASE_MODEL_DIR |
LayoutLM base model for training |
TRAINED_MODEL_DIR |
Fine-tuned model for inference |
RESULTS_DIR |
Output directory for inference results |
Path auto-detection works on macOS, Linux and Windows -- manual config is only needed if tools are not in system PATH.
pytorch_modular/
├── invoice-analyzer/
│ ├── config.py # Shared configuration, labels, path auto-detection
│ ├── utils.py # Format validators, text similarity, normalization
│ ├── ocr.py # PDF-to-image conversion, Tesseract OCR
│ ├── model.py # LayoutLM model/tokenizer loading (v1 + v3)
│ ├── extraction.py # Entity extraction, postprocessing, heuristics
│ ├── export.py # Excel export, prediction visualization
│ ├── evaluate.py # Model evaluation (precision/recall/F1)
│ ├── invoice_inference.py # Inference entry point
│ ├── train_invoice_model.py # Training entry point (incremental learning)
│ ├── preannotate.py # Model pre-annotation for active learning
│ ├── annotation_tool.py # PyQt6 GUI for annotation (dark theme, zoom, HiDPI)
│ ├── invoice_config.json # Paths and hyperparameters
│ ├── models/
│ │ ├── layoutlmv3-base/ # Base model from HuggingFace (not tracked)
│ │ └── invoice_model_v3/ # Fine-tuned model (not tracked)
│ ├── input_data/ # PDF invoices for training (not tracked)
│ └── output_data/
│ ├── annotations/ # Training annotations (.xlsx)
│ ├── preannotations/ # Model pre-annotations for active learning
│ └── results/ # Inference output (.xlsx)
├── requirements.txt
├── SETUP.md # Detailed setup instructions
└── README.md
macOS:
brew install tesseract tesseract-lang popplerLinux:
sudo apt install tesseract-ocr tesseract-ocr-ita poppler-utilsWindows: see SETUP.md for download links.
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtDownload from HuggingFace and place files in invoice-analyzer/models/layoutlmv3-base/.
cd invoice-analyzer
# Create training annotations (GUI)
python annotation_tool.py
# Train the model
python train_invoice_model.py --annotations <annotations.xlsx> --pdfs_dir <pdf_folder>
# Run inference on a new invoice
python invoice_inference.py --pdf <invoice.pdf>Once you have a trained model, use it to speed up annotation of new invoices:
# Model pre-annotates new PDFs (batch or single file)
python preannotate.py --pdf input_data/new_invoice.pdf
python preannotate.py --pdf input_data/ --min_confidence 0.3
# Open annotation tool -> "Importa Pre-annotazioni" -> correct errors -> save
python annotation_tool.py
# Re-train with expanded dataset
python train_invoice_model.py --annotations <combined.xlsx> --continue_from models/invoice_model_v3
# Evaluate improvement
python evaluate.py --output output_data/results/eval_v3_round2.jsonSee SETUP.md for full configuration details and all CLI parameters.
Evaluated on 3 annotated SAGE invoices (50 annotated words total). Training-set metrics to track model improvements.
| Entity | N | Exact Match | F1 (exact) | Avg Similarity |
|---|---|---|---|---|
| VENDOR | 3 | 0/3 | 0.000 | 0.185 |
| CUSTOMER | 3 | 0/3 | 0.000 | 0.288 |
| DATE | 3 | 3/3 | 1.000 | 1.000 |
| TOTAL | 2 | 0/2 | 0.000 | 0.146 |
| INVOICE_NUMBER | 3 | 0/3 | 0.000 | 0.200 |
| OVERALL | 14 | 3/14 | -- | 0.379 |
| Entity | N | Exact Match | F1 (exact) | Avg Similarity |
|---|---|---|---|---|
| DATE | 3 | 2/3 | 0.667 | 0.800 |
| Others | 11 | 0/11 | 0.000 | -- |
| OVERALL | 14 | 2/14 | -- | 0.288 |
v3 vs v1: DATE 67% -> 100% F1, overall similarity 0.288 -> 0.379. More training data needed for VENDOR/CUSTOMER/TOTAL/INVOICE_NUMBER.
Run evaluation:
cd invoice-analyzer
python evaluate.py --output output_data/results/eval_results.json- macOS (Apple Silicon M3 Pro) -- Python 3.14, PyTorch 2.10, Transformers 5.2
- Windows 10/11 -- Python 3.10+
- Model: LayoutLMv3 (multimodal: text + layout + vision)
- Deep Learning: PyTorch
- Training: HuggingFace Trainer + Accelerate
- OCR: Tesseract (via pytesseract)
- PDF rendering: Poppler (via pdf2image)
- Annotation GUI: PyQt6 instead of tkinter -- native HiDPI/Retina support (correct click coordinates on high-DPI displays), proper zoom/pan via QGraphicsView, dark theme
- Data: pandas + openpyxl
MIT