📄 Local RAG-Powered Assistant

Interact with your PDF documents using open-source large language models locally and securely. This project leverages Retrieval-Augmented Generation (RAG) and Ollama to let you upload a PDF and chat with its contents in natural language.

No APIs. No internet. 100% local. 100% open-source.

🚀 Key Features

🔍 PDF understanding via document chunking and embeddings
🧠 Local LLMs using Ollama
🧩 RAG pipeline with Chroma vector database
🤖 Conversational UI using Streamlit
⚡ Toggle between basic and enhanced retrieval modes
🗂️ Automatic model + PDF change detection and re-indexing

🛠️ Tech Stack

Tool	Purpose
Streamlit	Web UI for chatting
LangChain	LLM and RAG pipeline orchestration
Ollama	Running local LLMs (e.g., `gemma`, `qwen`)
Chroma	Vector storage and similarity search
PyPDFLoader	PDF content extraction
nomic-embed-text	Embedding model for vectorization

🧠 Models Supported

You can run the following models locally via Ollama:

arnold — A custom LLM.
gemma3:4b — Lightweight, fast LLM from Google
qwen3:8b — High-performance multilingual model from Alibaba
nomic-embed-text — A high-performing open embedding model with a large token context window

Note: All models are downloaded and managed via ollama pull.

🛠️ About `arnold` (custom model)

The arnold model was created using a separate script (custom_llm_arnold.py), which defines how to:

Load a base model (e.g., gemma3:4b)
Fine-tune or modify behavior with prompt templates or adapters
Serve it through Ollama as a personalized model

🔁 Workflow

PDF Upload: The user uploads a .pdf file.
Document Loading: PyPDFLoader extracts the content.
Chunking: Text is split into overlapping segments using RecursiveCharacterTextSplitter.
Embedding: Each chunk is embedded using nomic-embed-text locally via Ollama.
Vector Store: Chunks are stored in a ChromaDB instance.
Retrieval:
- Standard: basic similarity search.
- Enhanced: multi-query expansion for more robust semantic matching.
LLM Response: The selected LLM answers based on the retrieved chunks.
Chat History: Messages persist during the session for context.

🖥️ Performance Insights

This project has been tested locally using the following hardware configuration:

Component	Model	Utilization
🧠 CPU	AMD Ryzen 7 6800H	~54% usage during inference
🎮 GPU	NVIDIA GeForce RTX 3050	~40% usage with `nomic-embed-text` and LLMs

⚡ Result: Smooth interaction, real-time responses, and fast embedding/answering — all on local resources.

✅ These results indicate this app is efficient and runnable on mid-tier hardware with no cloud dependency.

🚀 Quickstart

Make sure you have Python 3.10 installed. Then:

# Install required packages
pip install -r requirements.txt

# Pull the necessary models (can take a few minutes the first time)
ollama pull nomic-embed-text
ollama pull arnold
ollama pull qwen3:8b
ollama pull gemma3:4b

# Run the app
streamlit run main.py

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.idea		.idea
README.md		README.md
chat_demo.png		chat_demo.png
custom_llm_arnold.py		custom_llm_arnold.py
main.py		main.py
pdf_processor.py		pdf_processor.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📄 Local RAG-Powered Assistant

🚀 Key Features

🛠️ Tech Stack

🧠 Models Supported

🛠️ About `arnold` (custom model)

🔁 Workflow

🖥️ Performance Insights

🚀 Quickstart

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📄 Local RAG-Powered Assistant

🚀 Key Features

🛠️ Tech Stack

🧠 Models Supported

🛠️ About arnold (custom model)

🔁 Workflow

🖥️ Performance Insights

🚀 Quickstart

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

🛠️ About `arnold` (custom model)

Packages