A Streamlit-based application that performs Retrieval-Augmented Generation (RAG) to answer questions based on uploaded documents. It also provides document summarization capabilities.
- Document Ingestion: Upload text-based documents (PDF, TXT, MD, HTML, DOCX).
- Automatic Summarization: Instantly generates a summary of the uploaded document using
MarkItDownand GenAI. - RAG Chatbot: Ask questions about the uploaded content. The app uses ChromaDB for vector storage and retrieval to provide accurate, context-aware answers.
- Persistent Storage: Vector embeddings are stored locally in
chroma_dbfor persistent retrieval across sessions.
- Frontend: Streamlit
- Vector Database: ChromaDB
- LLM Integration: OpenAI (via
tiktokenfor tokenization and standard API calls) - Document Processing:
MarkItDownfor converting various file formats to text.
- Python 3.8+
- An OpenAI API Key
-
Clone the repository:
git clone https://github.com/abd-RAHEEM/RAGx.git cd RAGx -
Create and activate a virtual environment:
Windows:
create_venv.bat
Or manually:
python -m venv .venv .venv\Scripts\activate
Linux/Mac:
python3 -m venv .venv source .venv/bin/activate -
Install dependencies:
pip install -r requirements.txt
-
Set up Environment Variables: Create a
.envfile in the root directory (copy from a template if available, excluding secrets from git):MODEL_API_KEY=your_openai_api_key_here CHROMA_COLLECTION_NAME=your_collection_name
-
Run the application:
streamlit run main.py
-
Navigate to the App: Open your browser at
http://localhost:8501. -
Ingest Documents:
- Go to the Ingest page.
- Upload a file.
- View the summary.
- Click "Upload & Ingest to Chroma DB" to index the content.
-
Chat:
- Switch to the Chatbot page.
- Ask questions related to the document you just uploaded.
RAGx/
├── .env # Environment variables (not contained in repo)
├── .gitignore # Git ignore file
├── activate.bat # Helper script to activate venv
├── create_venv.bat # Helper script to create venv
├── chroma_db/ # Persistent Vector DB storage
├── chroma_services.py # ChromaDB interaction logic
├── data.txt # Sample data file
├── genai_services.py # AI/LLM service logic
├── main.py # Entry point for Streamlit app
├── pages/ # Streamlit pages
│ ├── chatbot_page.py
│ └── ingest_page.py
└── requirements.txt # Python dependencies
- API Keys: Ensure your
.envfile is never pushed to public repositories. This project is configured with a.gitignorethat excludes.envand thechroma_dbdirectory.