🤖 RAG PDF Question Answering System

A sophisticated Retrieval-Augmented Generation (RAG) application that answers questions based on PDF documents using FastAPI, ChromaDB, and Google Gemini 2.0 Flash. The system automatically processes and indexes PDF documents to provide accurate, context-aware answers.

🚀 Features

Core Capabilities

Automated PDF Processing: Pre-loads and processes PDF documents on startup
Intelligent Text Chunking: Splits documents with configurable overlap for better context
Advanced Tokenization: Uses tiktoken for accurate token counting and processing
Semantic Search: Leverages Sentence Transformers for high-quality embeddings
Persistent Storage: ChromaDB vector database with automatic persistence
AI-Powered Answers: Google Gemini 2.0 Flash for intelligent response generation
RESTful API: FastAPI-based API with automatic documentation
Web Interface: Beautiful HTML frontend for easy testing
Real-time Status: Health monitoring and system status checks

Technical Features

Asynchronous Processing: FastAPI async endpoints for better performance
Error Handling: Comprehensive error handling and logging
CORS Support: Cross-origin resource sharing for web integration
Configurable Parameters: Customizable chunk sizes, overlap, and model settings
Database Management: Reset, reload, and manage document collections

🏗️ System Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   PDF Document  │ -> │  Text Extraction│ -> │   Text Chunking │
│ (IN 1501.pdf)   │    │   (PyPDF2)      │    │ (Smart Overlap) │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         v                       v                       v
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Tokenization  │ -> │ Vector Embeddings│ -> │  ChromaDB Store │
│   (tiktoken)    │    │(SentenceTransf.) │    │  (Persistent)   │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         v                       v                       v
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│ User Question   │ -> │ Semantic Search │ -> │ Context Retrieval│
│   (Frontend)    │    │  (Similarity)   │    │ (Top-K Results) │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         v                       v                       v
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Answer Gen.   │ <- │ Gemini 2.0 Flash│ <- │ Context + Query │
│   (Response)    │    │      (LLM)      │    │  (Prompt Eng.)  │
└─────────────────┘    └─────────────────┘    └─────────────────┘

📁 Project Structure

rag-pdf-qa/
├── 📄 main.py              # FastAPI application & endpoints
├── ⚙️ config.py             # Configuration settings
├── 📋 pdf_processor.py     # PDF extraction and chunking logic
├── 🧮 embeddings.py        # Tokenization & embedding generation
├── 🗄️ vector_store.py      # ChromaDB integration & management
├── 🤖 llm_service.py       # Gemini LLM integration
├── 🌐 frontend.html        # Web interface for testing
├── 🔧 setup_pdf.py         # PDF preprocessing utility
├── 📦 requirements.txt     # Python dependencies
├── 🔐 .env                 # Environment variables (API keys)
├── 📖 README.md           # This documentation
├── 🧪 test_main.http      # HTTP test requests
├── 📁 data/               # PDF documents folder
│   └── IN 1501 - Signals.pdf
├── 🗃️ chroma_db/          # Vector database (auto-created)
└── 🐍 __pycache__/        # Python cache files

📋 Prerequisites

System Requirements

Python: 3.8 or higher
RAM: Minimum 4GB (8GB+ recommended for larger documents)
Storage: ~500MB for dependencies + document storage
OS: Windows, macOS, or Linux

Required Accounts

Google AI Studio Account: For Gemini API access
Internet Connection: For downloading models and API calls

⚡ Quick Start Guide

1. Clone and Navigate

# Navigate to your project directory
cd "d:\Uom IT\session"

2. Install Dependencies

pip install -r requirements.txt

3. Configure API Key

# Copy the example environment file
copy .env.example .env

# Edit .env file and add your Google API key:
# GOOGLE_API_KEY=your_actual_api_key_here

4. Start the Application

python main.py
# OR
uvicorn main:app --reload --host 0.0.0.0 --port 8000

5. Access the System

API: http://localhost:8000
Web Interface: Open frontend.html in your browser
API Docs: http://localhost:8000/docs

🔧 Detailed Setup

Step 1: Environment Setup

Install Python 3.8+
```
python --version  # Check your version
```

Create Virtual Environment (Recommended)

python -m venv .venv
.venv\Scripts\activate  # On Windows
# source .venv/bin/activate  # On macOS/Linux

Step 2: Install Dependencies

pip install -r requirements.txt

Key Dependencies:

fastapi>=0.100.0 - Web framework
uvicorn>=0.20.0 - ASGI server
PyPDF2>=3.0.0 - PDF processing
chromadb>=1.3.0 - Vector database
google-generativeai>=0.8.0 - Gemini integration
sentence-transformers>=2.2.0 - Embeddings
tiktoken>=0.5.0 - Tokenization

Step 3: Get Google Gemini API Key

Visit Google AI Studio
- Go to https://makersuite.google.com/app/apikey
Create API Key
- Click "Create API Key"
- Choose your project or create a new one
- Copy the generated API key

Configure Environment

# Copy example file
copy .env.example .env

# Edit .env file with your API key
GOOGLE_API_KEY=your_actual_api_key_here

Step 4: Prepare Your PDF

Add PDF Document
- Place your PDF in the data/ folder
- Current document: IN 1501 - Signals.pdf
Verify PDF Location
```
dir data\  # Should show your PDF file
```

Step 5: Start the Application

Method 1: Direct Python

python main.py

Method 2: Using Uvicorn

uvicorn main:app --reload --host 0.0.0.0 --port 8000

Method 3: Background Process

uvicorn main:app --host 0.0.0.0 --port 8000 &

Step 6: Verify Installation

Check Server Status
- Visit: http://localhost:8000/health
- Should show: {"status": "healthy", "documents_count": X}
Access API Documentation
- Visit: http://localhost:8000/docs
- Interactive Swagger UI for testing
Open Web Interface
- Open frontend.html in your browser
- Should show green status indicator

📚 Usage Guide

Web Interface (Recommended)

Open Frontend
- Double-click frontend.html or open in browser
- Verify green status indicator (server must be running)
Ask Questions
- Use sample questions or type your own
- Adjust number of sources (1-10)
- Click "🔍 Ask Question"
- View answer and source documents
Sample Questions for Testing
- "What are the main types of signals discussed in the document?"
- "Explain the difference between analog and digital signals."
- "What is signal processing and why is it important?"
- "How are signals used in communication systems?"

API Usage (Advanced)

PowerShell Examples

# Ask a question
$body = @{
    question = "What are digital signals?"
    top_k = 5
} | ConvertTo-Json

$response = Invoke-RestMethod -Uri "http://localhost:8000/ask" -Method POST -Body $body -ContentType "application/json"
Write-Output $response.answer

cURL Examples

# Health check
curl -X GET "http://localhost:8000/health"

# Ask question
curl -X POST "http://localhost:8000/ask" \
  -H "Content-Type: application/json" \
  -d '{"question": "What is signal processing?", "top_k": 3}'

# Reset database
curl -X POST "http://localhost:8000/reset"

🔌 API Documentation

Available Endpoints

1. Root Endpoint

GET /

Description: Returns API information and available endpoints
Response: JSON with endpoint descriptions

2. Ask Question ⭐ Primary Endpoint

POST /ask
Content-Type: application/json

Description: Ask a question based on the loaded PDF document

Request Body:

{
  "question": "What are the main types of signals?",
  "top_k": 5  // Optional: number of sources (1-10)
}

Response:

{
  "question": "What are the main types of signals?",
  "answer": "Based on the document, the main types of signals are...",
  "sources": [
    "Text chunk 1 from PDF...",
    "Text chunk 2 from PDF...",
    "Text chunk 3 from PDF..."
  ],
  "source_count": 3
}

3. Health Check

GET /health

Description: Check API health and database status

Response:

{
  "status": "healthy",
  "documents_count": 42,
  "embedding_dimension": 384
}

4. Reset Database

POST /reset

Description: Clear database and reload PDF

Response:

{
  "message": "Database reset and PDF reloaded successfully",
  "documents_count": 42
}

5. Reload PDF

POST /reload

Description: Reload PDF documents into database

Response:

{
  "message": "PDF reloaded successfully",
  "documents_count": 42
}

Interactive Documentation

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

⚙️ Configuration

Environment Variables (`.env`)

GOOGLE_API_KEY=your_gemini_api_key_here

Application Settings (`config.py`)

class Settings:
    GOOGLE_API_KEY: str = "your_api_key"
    CHUNK_SIZE: int = 1000           # Text chunk size
    CHUNK_OVERLAP: int = 200         # Overlap between chunks
    EMBEDDING_MODEL: str = "all-MiniLM-L6-v2"  # Sentence transformer model
    GEMINI_MODEL: str = "gemini-2.0-flash-exp"  # Gemini model version
    CHROMA_PERSIST_DIR: str = "./chroma_db"     # Database directory
    MAX_TOKENS: int = 8192           # Maximum tokens per request
    TEMPERATURE: float = 0.7         # LLM creativity (0.0-1.0)
    TOP_K_RESULTS: int = 5          # Default number of sources

Customization Options

Text Processing:

CHUNK_SIZE: Larger chunks = more context, fewer chunks
CHUNK_OVERLAP: Higher overlap = better context continuity
Modify pdf_processor.py for custom chunking logic

Embedding Model:

Current: all-MiniLM-L6-v2 (384 dimensions, fast)
Alternative: all-mpnet-base-v2 (768 dimensions, more accurate)
Alternative: paraphrase-multilingual-MiniLM-L12-v2 (multilingual)

Gemini Models:

gemini-2.0-flash-exp: Latest experimental (fastest)
gemini-1.5-flash: Stable and fast
gemini-1.5-pro: More capable but slower

🔧 Advanced Configuration

Custom PDF Processing

To use a different PDF or add multiple PDFs:

Add PDF to data folder:
```
copy "your-document.pdf" "data\"
```

Update setup_pdf.py:

# Change line ~29 in setup_pdf.py
pdf_path = r"data\your-document.pdf"

Restart application:
```
python main.py
```

Database Management

# In Python console or script
from vector_store import VectorStore

# Create vector store instance
vs = VectorStore()

# Check document count
print(vs.get_collection_count())

# Reset collection
vs.reset_collection()

# Recreate collection
vs.create_collection()

Performance Tuning

For Larger Documents:

# In config.py - increase chunk size
CHUNK_SIZE = 2000
CHUNK_OVERLAP = 400
MAX_TOKENS = 16384

For Better Accuracy:

# In config.py - use better embedding model
EMBEDDING_MODEL = "all-mpnet-base-v2"
TOP_K_RESULTS = 10  # More context
TEMPERATURE = 0.3   # Less creative, more factual

🔍 How It Works (Technical Deep Dive)

1. PDF Processing Pipeline

PDF File → PyPDF2 → Raw Text → Text Cleaning → Smart Chunking

PyPDF2: Extracts text from PDF pages
Text Cleaning: Removes extra whitespace, normalizes text
Smart Chunking: Splits on sentences/paragraphs when possible

2. Embedding Generation

Text Chunks → Tokenization → SentenceTransformer → Vector Embeddings

Tokenization: tiktoken converts text to tokens
SentenceTransformer: Creates 384-dimensional vectors
Normalization: Vectors normalized for cosine similarity

3. Vector Storage

Embeddings + Metadata → ChromaDB → Persistent Storage

ChromaDB: High-performance vector database
Metadata: Source info, chunk IDs, timestamps
Indexing: HNSW algorithm for fast similarity search

4. Query Processing

User Question → Embedding → Similarity Search → Top-K Results

Query Embedding: Same model as document embeddings
Cosine Similarity: Measures semantic similarity
Ranking: Returns most relevant chunks

5. Answer Generation

Question + Context → Prompt Engineering → Gemini 2.0 → Final Answer

Context Assembly: Combines relevant chunks
Prompt Template: Instructs LLM on response format
Generation: Gemini creates contextual answer

🚨 Troubleshooting

Common Issues and Solutions

1. Server Won't Start

Error: [WinError 10048] Only one usage of each socket address

Solution: Port 8000 is busy

# Find process using port 8000
netstat -ano | findstr :8000

# Kill the process
taskkill /PID <process_id> /F

# Or use different port
uvicorn main:app --host 0.0.0.0 --port 8080

2. Gemini API Errors

Error: 404 models/gemini-pro is not found

Solution: Update model name in config.py

GEMINI_MODEL = "gemini-2.0-flash-exp"  # Latest model

Error: API key not valid

Solution: Check your .env file

# Verify API key in .env
type .env

# Get new API key from https://makersuite.google.com/app/apikey

3. PDF Processing Errors

Error: No text could be extracted from the PDF

Solutions:

Ensure PDF is not password protected
Check if PDF contains actual text (not just images)
Try a different PDF file

4. ChromaDB Issues

Error: sqlite3.OperationalError: database is locked

Solution: Reset database

# Stop server (Ctrl+C)
# Delete database directory
rmdir /s "chroma_db"
# Restart server
python main.py

5. Frontend Not Working

CORS policy error or network error

Solutions:

Ensure server is running on http://localhost:8000
Check browser console for detailed errors
Try opening frontend.html directly in browser

6. Memory Issues

Error: Out of memory

Solutions:

Reduce CHUNK_SIZE in config.py
Use smaller embedding model
Process smaller PDF files

7. Slow Performance

Questions take too long to answer

Solutions:

Reduce TOP_K_RESULTS (fewer sources)
Use faster model: gemini-1.5-flash
Increase CHUNK_SIZE (fewer chunks to search)

Debug Mode

Enable detailed logging:

# In main.py, change logging level
logging.basicConfig(level=logging.DEBUG)

Health Diagnostics

# Check all endpoints
curl http://localhost:8000/health
curl http://localhost:8000/
curl -X POST http://localhost:8000/reload

📈 Performance Metrics

Typical Response Times

PDF Loading: 10-30 seconds (one-time)
Simple Questions: 2-5 seconds
Complex Questions: 5-10 seconds
Database Operations: <1 second

Resource Usage

Memory: 2-4 GB during operation
Storage: ~100MB for embeddings per 100-page PDF
CPU: Moderate during embedding generation

Scaling Considerations

Document Size: Optimal for PDFs <500 pages
Concurrent Users: 5-10 simultaneous requests
Database Size: Up to 10,000 chunks tested

🛡️ Security & Best Practices

API Key Security

Never commit .env files to version control
Use environment variables in production
Rotate API keys regularly

Production Deployment

# For production use:
uvicorn main:app --host 0.0.0.0 --port 8000 --workers 4

Data Privacy

PDFs are processed locally
Only questions sent to Gemini API
Vector database stored locally

🏁 Final Checklist

✅ Prerequisites

Python 3.8+ installed
Google API key obtained
PDF document in data/ folder

✅ Installation

Dependencies installed: pip install -r requirements.txt
Environment configured: .env file with API key
Server starts successfully: python main.py

✅ Testing

Health check passes: http://localhost:8000/health
Frontend loads: Open frontend.html
Questions work: Try sample questions
API documentation accessible: http://localhost:8000/docs

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
config.py		config.py
embeddings.py		embeddings.py
frontend.html		frontend.html
llm_service.py		llm_service.py
main.py		main.py
pdf_processor.py		pdf_processor.py
requirements.txt		requirements.txt
setup_pdf.py		setup_pdf.py
test_main.http		test_main.http
vector_store.py		vector_store.py

nethmalgunawardhana/pdf-rag-system

Folders and files

Latest commit

History

Repository files navigation

🤖 RAG PDF Question Answering System

📋 Table of Contents

🚀 Features

Core Capabilities

Technical Features

🏗️ System Architecture

📁 Project Structure

📋 Prerequisites

System Requirements

Required Accounts

⚡ Quick Start Guide

1. Clone and Navigate

2. Install Dependencies

3. Configure API Key

4. Start the Application

5. Access the System

🔧 Detailed Setup

Step 1: Environment Setup

Step 2: Install Dependencies

Step 3: Get Google Gemini API Key

Step 4: Prepare Your PDF

Step 5: Start the Application

Step 6: Verify Installation

📚 Usage Guide

Web Interface (Recommended)

API Usage (Advanced)

PowerShell Examples

cURL Examples

🔌 API Documentation

Available Endpoints

1. Root Endpoint

2. Ask Question ⭐ Primary Endpoint

3. Health Check

4. Reset Database

5. Reload PDF

Interactive Documentation

⚙️ Configuration

Environment Variables (.env)

Application Settings (config.py)

Customization Options

🔧 Advanced Configuration

Custom PDF Processing

Database Management

Performance Tuning

🔍 How It Works (Technical Deep Dive)

1. PDF Processing Pipeline

2. Embedding Generation

3. Vector Storage

4. Query Processing

5. Answer Generation

🚨 Troubleshooting

Common Issues and Solutions

1. Server Won't Start

2. Gemini API Errors

3. PDF Processing Errors

4. ChromaDB Issues

5. Frontend Not Working

6. Memory Issues

7. Slow Performance

Debug Mode

Health Diagnostics

📈 Performance Metrics

Typical Response Times

Resource Usage

Scaling Considerations

🛡️ Security & Best Practices

API Key Security

Production Deployment

Data Privacy

🏁 Final Checklist

About

Resources

Uh oh!

Stars

Watchers

Environment Variables (`.env`)

Application Settings (`config.py`)

Packages