Skip to content

Rushant-123/Casteg

Repository files navigation

Transaction Categorization Engine

Hybrid ML pipeline for intelligent accounting transaction categorization using FAISS, BM25, and GPT-4.

Python FAISS OpenAI


Overview

This engine automatically categorizes financial transactions by matching them against a general ledger using a multi-stage hybrid search approach:

  1. Fuzzy Matching — Token-based similarity for typo tolerance
  2. TF-IDF Semantic Search — Traditional text similarity
  3. Dense Retrieval (FAISS) — OpenAI embeddings with vector similarity
  4. Sparse Retrieval (BM25) — Keyword-based ranking
  5. Hybrid Re-ranking — Combines dense + sparse results
  6. GPT-4 Classification — Final categorization with explanation

Architecture

Transaction Input
       │
       ▼
┌──────────────────┐
│  Preprocessing   │  ← Clean names, memos, normalize amounts
└────────┬─────────┘
         │
         ▼
┌──────────────────┐     ┌──────────────────┐
│  Dense Search    │     │  Sparse Search   │
│  (FAISS + Ada)   │     │  (BM25 + TF-IDF) │
└────────┬─────────┘     └────────┬─────────┘
         │                        │
         └──────────┬─────────────┘
                    │
                    ▼
         ┌──────────────────┐
         │  Hybrid Re-rank  │
         └────────┬─────────┘
                  │
                  ▼
         ┌──────────────────┐
         │  GPT-4 Classify  │  ← Final decision with explanation
         └────────┬─────────┘
                  │
                  ▼
         Categorized Transaction

Features

  • Multi-Algorithm Matching — Falls back through multiple strategies
  • Embedding Cache — Pickle-based caching for corpus embeddings
  • Configurable Thresholds — Tune fuzzy/semantic match thresholds
  • Explainable Results — GPT provides reasoning for each categorization
  • Batch Processing — Process entire transaction files at once

Quick Start

Installation

pip install -r requirements.txt

Environment Variables

export AZURE_OPENAI_API_KEY="your_api_key"
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
export AZURE_OPENAI_API_VERSION="2024-05-01-preview"

Usage

from matching_algorithms import (
    fuzzy_search,
    semantic_search,
    hybrid_search,
    gpt_categorize
)

# Fuzzy match
matches = fuzzy_search("AMAZON WEB SERVICES", corpus, threshold=70)

# Hybrid search (dense + sparse)
matches = hybrid_search(
    query="AWS monthly subscription",
    corpus=ledger_descriptions,
    corpus_embeddings=embeddings,
    faiss_index=index,
    bm25=bm25_model,
    k=5
)

# GPT categorization
result = gpt_categorize(transaction, potential_matches, api_key)
print(result["category"])       # "Cloud Services"
print(result["explanation"])    # "AWS is a cloud provider..."

Run Full Pipeline

python main.py

Tech Stack

Component Technology
Embeddings Azure OpenAI text-embedding-3-large
Vector Index FAISS (Facebook AI Similarity Search)
Sparse Search BM25Okapi (rank_bm25)
Fuzzy Match FuzzyWuzzy (token_sort_ratio)
Classification GPT-4o-mini
Data Processing Pandas, NumPy

File Structure

├── main.py                  # Entry point
├── matching_algorithms.py   # Core search algorithms
├── category_mapping.py      # Category mapping logic
├── data_processing.py       # Data loading and preprocessing
├── transaction.py           # Transaction data model
├── utils.py                 # Utility functions
├── constants.py             # Configuration constants
└── requirements.txt         # Dependencies

Performance

On a test set of 220 transactions:

  • Match Rate: ~85%+ with hybrid search
  • Embedding Generation: ~0.5s per transaction
  • Search Latency: <100ms with cached embeddings

License

MIT

About

Hybrid ML transaction categorization with FAISS, BM25, and GPT-4 for accounting automation

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages