Hybrid ML pipeline for intelligent accounting transaction categorization using FAISS, BM25, and GPT-4.
This engine automatically categorizes financial transactions by matching them against a general ledger using a multi-stage hybrid search approach:
- Fuzzy Matching — Token-based similarity for typo tolerance
- TF-IDF Semantic Search — Traditional text similarity
- Dense Retrieval (FAISS) — OpenAI embeddings with vector similarity
- Sparse Retrieval (BM25) — Keyword-based ranking
- Hybrid Re-ranking — Combines dense + sparse results
- GPT-4 Classification — Final categorization with explanation
Transaction Input
│
▼
┌──────────────────┐
│ Preprocessing │ ← Clean names, memos, normalize amounts
└────────┬─────────┘
│
▼
┌──────────────────┐ ┌──────────────────┐
│ Dense Search │ │ Sparse Search │
│ (FAISS + Ada) │ │ (BM25 + TF-IDF) │
└────────┬─────────┘ └────────┬─────────┘
│ │
└──────────┬─────────────┘
│
▼
┌──────────────────┐
│ Hybrid Re-rank │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ GPT-4 Classify │ ← Final decision with explanation
└────────┬─────────┘
│
▼
Categorized Transaction
- Multi-Algorithm Matching — Falls back through multiple strategies
- Embedding Cache — Pickle-based caching for corpus embeddings
- Configurable Thresholds — Tune fuzzy/semantic match thresholds
- Explainable Results — GPT provides reasoning for each categorization
- Batch Processing — Process entire transaction files at once
pip install -r requirements.txtexport AZURE_OPENAI_API_KEY="your_api_key"
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
export AZURE_OPENAI_API_VERSION="2024-05-01-preview"from matching_algorithms import (
fuzzy_search,
semantic_search,
hybrid_search,
gpt_categorize
)
# Fuzzy match
matches = fuzzy_search("AMAZON WEB SERVICES", corpus, threshold=70)
# Hybrid search (dense + sparse)
matches = hybrid_search(
query="AWS monthly subscription",
corpus=ledger_descriptions,
corpus_embeddings=embeddings,
faiss_index=index,
bm25=bm25_model,
k=5
)
# GPT categorization
result = gpt_categorize(transaction, potential_matches, api_key)
print(result["category"]) # "Cloud Services"
print(result["explanation"]) # "AWS is a cloud provider..."python main.py| Component | Technology |
|---|---|
| Embeddings | Azure OpenAI text-embedding-3-large |
| Vector Index | FAISS (Facebook AI Similarity Search) |
| Sparse Search | BM25Okapi (rank_bm25) |
| Fuzzy Match | FuzzyWuzzy (token_sort_ratio) |
| Classification | GPT-4o-mini |
| Data Processing | Pandas, NumPy |
├── main.py # Entry point
├── matching_algorithms.py # Core search algorithms
├── category_mapping.py # Category mapping logic
├── data_processing.py # Data loading and preprocessing
├── transaction.py # Transaction data model
├── utils.py # Utility functions
├── constants.py # Configuration constants
└── requirements.txt # Dependencies
On a test set of 220 transactions:
- Match Rate: ~85%+ with hybrid search
- Embedding Generation: ~0.5s per transaction
- Search Latency: <100ms with cached embeddings
MIT