Skip to content

nethmalgunawardhana/pdf-rag-system

Repository files navigation

🤖 RAG PDF Question Answering System

A sophisticated Retrieval-Augmented Generation (RAG) application that answers questions based on PDF documents using FastAPI, ChromaDB, and Google Gemini 2.0 Flash. The system automatically processes and indexes PDF documents to provide accurate, context-aware answers.

📋 Table of Contents

  1. Features
  2. System Architecture
  3. Project Structure
  4. Prerequisites
  5. Quick Start Guide
  6. Detailed Setup
  7. Usage Guide
  8. API Documentation
  9. Configuration
  10. Troubleshooting

🚀 Features

Core Capabilities

  • Automated PDF Processing: Pre-loads and processes PDF documents on startup
  • Intelligent Text Chunking: Splits documents with configurable overlap for better context
  • Advanced Tokenization: Uses tiktoken for accurate token counting and processing
  • Semantic Search: Leverages Sentence Transformers for high-quality embeddings
  • Persistent Storage: ChromaDB vector database with automatic persistence
  • AI-Powered Answers: Google Gemini 2.0 Flash for intelligent response generation
  • RESTful API: FastAPI-based API with automatic documentation
  • Web Interface: Beautiful HTML frontend for easy testing
  • Real-time Status: Health monitoring and system status checks

Technical Features

  • Asynchronous Processing: FastAPI async endpoints for better performance
  • Error Handling: Comprehensive error handling and logging
  • CORS Support: Cross-origin resource sharing for web integration
  • Configurable Parameters: Customizable chunk sizes, overlap, and model settings
  • Database Management: Reset, reload, and manage document collections

🏗️ System Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   PDF Document  │ -> │  Text Extraction│ -> │   Text Chunking │
│ (IN 1501.pdf)   │    │   (PyPDF2)      │    │ (Smart Overlap) │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         v                       v                       v
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Tokenization  │ -> │ Vector Embeddings│ -> │  ChromaDB Store │
│   (tiktoken)    │    │(SentenceTransf.) │    │  (Persistent)   │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         v                       v                       v
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│ User Question   │ -> │ Semantic Search │ -> │ Context Retrieval│
│   (Frontend)    │    │  (Similarity)   │    │ (Top-K Results) │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         v                       v                       v
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Answer Gen.   │ <- │ Gemini 2.0 Flash│ <- │ Context + Query │
│   (Response)    │    │      (LLM)      │    │  (Prompt Eng.)  │
└─────────────────┘    └─────────────────┘    └─────────────────┘

📁 Project Structure

rag-pdf-qa/
├── 📄 main.py              # FastAPI application & endpoints
├── ⚙️ config.py             # Configuration settings
├── 📋 pdf_processor.py     # PDF extraction and chunking logic
├── 🧮 embeddings.py        # Tokenization & embedding generation
├── 🗄️ vector_store.py      # ChromaDB integration & management
├── 🤖 llm_service.py       # Gemini LLM integration
├── 🌐 frontend.html        # Web interface for testing
├── 🔧 setup_pdf.py         # PDF preprocessing utility
├── 📦 requirements.txt     # Python dependencies
├── 🔐 .env                 # Environment variables (API keys)
├── 📖 README.md           # This documentation
├── 🧪 test_main.http      # HTTP test requests
├── 📁 data/               # PDF documents folder
│   └── IN 1501 - Signals.pdf
├── 🗃️ chroma_db/          # Vector database (auto-created)
└── 🐍 __pycache__/        # Python cache files

📋 Prerequisites

System Requirements

  • Python: 3.8 or higher
  • RAM: Minimum 4GB (8GB+ recommended for larger documents)
  • Storage: ~500MB for dependencies + document storage
  • OS: Windows, macOS, or Linux

Required Accounts

  • Google AI Studio Account: For Gemini API access
  • Internet Connection: For downloading models and API calls

⚡ Quick Start Guide

1. Clone and Navigate

# Navigate to your project directory
cd "d:\Uom IT\session"

2. Install Dependencies

pip install -r requirements.txt

3. Configure API Key

# Copy the example environment file
copy .env.example .env

# Edit .env file and add your Google API key:
# GOOGLE_API_KEY=your_actual_api_key_here

4. Start the Application

python main.py
# OR
uvicorn main:app --reload --host 0.0.0.0 --port 8000

5. Access the System

🔧 Detailed Setup

Step 1: Environment Setup

  1. Install Python 3.8+

    python --version  # Check your version
  2. Create Virtual Environment (Recommended)

    python -m venv .venv
    .venv\Scripts\activate  # On Windows
    # source .venv/bin/activate  # On macOS/Linux

Step 2: Install Dependencies

pip install -r requirements.txt

Key Dependencies:

  • fastapi>=0.100.0 - Web framework
  • uvicorn>=0.20.0 - ASGI server
  • PyPDF2>=3.0.0 - PDF processing
  • chromadb>=1.3.0 - Vector database
  • google-generativeai>=0.8.0 - Gemini integration
  • sentence-transformers>=2.2.0 - Embeddings
  • tiktoken>=0.5.0 - Tokenization

Step 3: Get Google Gemini API Key

  1. Visit Google AI Studio

  2. Create API Key

    • Click "Create API Key"
    • Choose your project or create a new one
    • Copy the generated API key
  3. Configure Environment

    # Copy example file
    copy .env.example .env
    
    # Edit .env file with your API key
    GOOGLE_API_KEY=your_actual_api_key_here

Step 4: Prepare Your PDF

  1. Add PDF Document

    • Place your PDF in the data/ folder
    • Current document: IN 1501 - Signals.pdf
  2. Verify PDF Location

    dir data\  # Should show your PDF file

Step 5: Start the Application

Method 1: Direct Python

python main.py

Method 2: Using Uvicorn

uvicorn main:app --reload --host 0.0.0.0 --port 8000

Method 3: Background Process

uvicorn main:app --host 0.0.0.0 --port 8000 &

Step 6: Verify Installation

  1. Check Server Status

  2. Access API Documentation

  3. Open Web Interface

    • Open frontend.html in your browser
    • Should show green status indicator

📚 Usage Guide

Web Interface (Recommended)

  1. Open Frontend

    • Double-click frontend.html or open in browser
    • Verify green status indicator (server must be running)
  2. Ask Questions

    • Use sample questions or type your own
    • Adjust number of sources (1-10)
    • Click "🔍 Ask Question"
    • View answer and source documents
  3. Sample Questions for Testing

    • "What are the main types of signals discussed in the document?"
    • "Explain the difference between analog and digital signals."
    • "What is signal processing and why is it important?"
    • "How are signals used in communication systems?"

API Usage (Advanced)

PowerShell Examples

# Ask a question
$body = @{
    question = "What are digital signals?"
    top_k = 5
} | ConvertTo-Json

$response = Invoke-RestMethod -Uri "http://localhost:8000/ask" -Method POST -Body $body -ContentType "application/json"
Write-Output $response.answer

cURL Examples

# Health check
curl -X GET "http://localhost:8000/health"

# Ask question
curl -X POST "http://localhost:8000/ask" \
  -H "Content-Type: application/json" \
  -d '{"question": "What is signal processing?", "top_k": 3}'

# Reset database
curl -X POST "http://localhost:8000/reset"

🔌 API Documentation

Available Endpoints

1. Root Endpoint

GET /

Description: Returns API information and available endpoints
Response: JSON with endpoint descriptions

2. Ask QuestionPrimary Endpoint

POST /ask
Content-Type: application/json

Description: Ask a question based on the loaded PDF document

Request Body:

{
  "question": "What are the main types of signals?",
  "top_k": 5  // Optional: number of sources (1-10)
}

Response:

{
  "question": "What are the main types of signals?",
  "answer": "Based on the document, the main types of signals are...",
  "sources": [
    "Text chunk 1 from PDF...",
    "Text chunk 2 from PDF...",
    "Text chunk 3 from PDF..."
  ],
  "source_count": 3
}

3. Health Check

GET /health

Description: Check API health and database status

Response:

{
  "status": "healthy",
  "documents_count": 42,
  "embedding_dimension": 384
}

4. Reset Database

POST /reset

Description: Clear database and reload PDF

Response:

{
  "message": "Database reset and PDF reloaded successfully",
  "documents_count": 42
}

5. Reload PDF

POST /reload

Description: Reload PDF documents into database

Response:

{
  "message": "PDF reloaded successfully",
  "documents_count": 42
}

Interactive Documentation

⚙️ Configuration

Environment Variables (.env)

GOOGLE_API_KEY=your_gemini_api_key_here

Application Settings (config.py)

class Settings:
    GOOGLE_API_KEY: str = "your_api_key"
    CHUNK_SIZE: int = 1000           # Text chunk size
    CHUNK_OVERLAP: int = 200         # Overlap between chunks
    EMBEDDING_MODEL: str = "all-MiniLM-L6-v2"  # Sentence transformer model
    GEMINI_MODEL: str = "gemini-2.0-flash-exp"  # Gemini model version
    CHROMA_PERSIST_DIR: str = "./chroma_db"     # Database directory
    MAX_TOKENS: int = 8192           # Maximum tokens per request
    TEMPERATURE: float = 0.7         # LLM creativity (0.0-1.0)
    TOP_K_RESULTS: int = 5          # Default number of sources

Customization Options

Text Processing:

  • CHUNK_SIZE: Larger chunks = more context, fewer chunks
  • CHUNK_OVERLAP: Higher overlap = better context continuity
  • Modify pdf_processor.py for custom chunking logic

Embedding Model:

  • Current: all-MiniLM-L6-v2 (384 dimensions, fast)
  • Alternative: all-mpnet-base-v2 (768 dimensions, more accurate)
  • Alternative: paraphrase-multilingual-MiniLM-L12-v2 (multilingual)

Gemini Models:

  • gemini-2.0-flash-exp: Latest experimental (fastest)
  • gemini-1.5-flash: Stable and fast
  • gemini-1.5-pro: More capable but slower

🔧 Advanced Configuration

Custom PDF Processing

To use a different PDF or add multiple PDFs:

  1. Add PDF to data folder:

    copy "your-document.pdf" "data\"
  2. Update setup_pdf.py:

    # Change line ~29 in setup_pdf.py
    pdf_path = r"data\your-document.pdf"
  3. Restart application:

    python main.py

Database Management

# In Python console or script
from vector_store import VectorStore

# Create vector store instance
vs = VectorStore()

# Check document count
print(vs.get_collection_count())

# Reset collection
vs.reset_collection()

# Recreate collection
vs.create_collection()

Performance Tuning

For Larger Documents:

# In config.py - increase chunk size
CHUNK_SIZE = 2000
CHUNK_OVERLAP = 400
MAX_TOKENS = 16384

For Better Accuracy:

# In config.py - use better embedding model
EMBEDDING_MODEL = "all-mpnet-base-v2"
TOP_K_RESULTS = 10  # More context
TEMPERATURE = 0.3   # Less creative, more factual

🔍 How It Works (Technical Deep Dive)

1. PDF Processing Pipeline

PDF File → PyPDF2 → Raw Text → Text Cleaning → Smart Chunking
  • PyPDF2: Extracts text from PDF pages
  • Text Cleaning: Removes extra whitespace, normalizes text
  • Smart Chunking: Splits on sentences/paragraphs when possible

2. Embedding Generation

Text Chunks → Tokenization → SentenceTransformer → Vector Embeddings
  • Tokenization: tiktoken converts text to tokens
  • SentenceTransformer: Creates 384-dimensional vectors
  • Normalization: Vectors normalized for cosine similarity

3. Vector Storage

Embeddings + Metadata → ChromaDB → Persistent Storage
  • ChromaDB: High-performance vector database
  • Metadata: Source info, chunk IDs, timestamps
  • Indexing: HNSW algorithm for fast similarity search

4. Query Processing

User Question → Embedding → Similarity Search → Top-K Results
  • Query Embedding: Same model as document embeddings
  • Cosine Similarity: Measures semantic similarity
  • Ranking: Returns most relevant chunks

5. Answer Generation

Question + Context → Prompt Engineering → Gemini 2.0 → Final Answer
  • Context Assembly: Combines relevant chunks
  • Prompt Template: Instructs LLM on response format
  • Generation: Gemini creates contextual answer

🚨 Troubleshooting

Common Issues and Solutions

1. Server Won't Start

Error: [WinError 10048] Only one usage of each socket address

Solution: Port 8000 is busy

# Find process using port 8000
netstat -ano | findstr :8000

# Kill the process
taskkill /PID <process_id> /F

# Or use different port
uvicorn main:app --host 0.0.0.0 --port 8080

2. Gemini API Errors

Error: 404 models/gemini-pro is not found

Solution: Update model name in config.py

GEMINI_MODEL = "gemini-2.0-flash-exp"  # Latest model
Error: API key not valid

Solution: Check your .env file

# Verify API key in .env
type .env

# Get new API key from https://makersuite.google.com/app/apikey

3. PDF Processing Errors

Error: No text could be extracted from the PDF

Solutions:

  • Ensure PDF is not password protected
  • Check if PDF contains actual text (not just images)
  • Try a different PDF file

4. ChromaDB Issues

Error: sqlite3.OperationalError: database is locked

Solution: Reset database

# Stop server (Ctrl+C)
# Delete database directory
rmdir /s "chroma_db"
# Restart server
python main.py

5. Frontend Not Working

CORS policy error or network error

Solutions:

  • Ensure server is running on http://localhost:8000
  • Check browser console for detailed errors
  • Try opening frontend.html directly in browser

6. Memory Issues

Error: Out of memory

Solutions:

  • Reduce CHUNK_SIZE in config.py
  • Use smaller embedding model
  • Process smaller PDF files

7. Slow Performance

Questions take too long to answer

Solutions:

  • Reduce TOP_K_RESULTS (fewer sources)
  • Use faster model: gemini-1.5-flash
  • Increase CHUNK_SIZE (fewer chunks to search)

Debug Mode

Enable detailed logging:

# In main.py, change logging level
logging.basicConfig(level=logging.DEBUG)

Health Diagnostics

# Check all endpoints
curl http://localhost:8000/health
curl http://localhost:8000/
curl -X POST http://localhost:8000/reload

📈 Performance Metrics

Typical Response Times

  • PDF Loading: 10-30 seconds (one-time)
  • Simple Questions: 2-5 seconds
  • Complex Questions: 5-10 seconds
  • Database Operations: <1 second

Resource Usage

  • Memory: 2-4 GB during operation
  • Storage: ~100MB for embeddings per 100-page PDF
  • CPU: Moderate during embedding generation

Scaling Considerations

  • Document Size: Optimal for PDFs <500 pages
  • Concurrent Users: 5-10 simultaneous requests
  • Database Size: Up to 10,000 chunks tested

🛡️ Security & Best Practices

API Key Security

  • Never commit .env files to version control
  • Use environment variables in production
  • Rotate API keys regularly

Production Deployment

# For production use:
uvicorn main:app --host 0.0.0.0 --port 8000 --workers 4

Data Privacy

  • PDFs are processed locally
  • Only questions sent to Gemini API
  • Vector database stored locally

🏁 Final Checklist

Prerequisites

  • Python 3.8+ installed
  • Google API key obtained
  • PDF document in data/ folder

Installation

  • Dependencies installed: pip install -r requirements.txt
  • Environment configured: .env file with API key
  • Server starts successfully: python main.py

Testing

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published