CrawlAI RAG is an AI-powered website intelligence platform that allows users to crawl entire websites, index their content, and ask natural-language questions using Retrieval-Augmented Generation (RAG).
It transforms static websites into queryable knowledge bases.
- Crawls all internal pages of a website
- Extracts clean, readable text
- Uses vector embeddings + LLM
- Answers are grounded in website content
- Minimizes hallucinations
- Index multiple websites
- All content stored in a shared vector database
- Built with FastAPI
- ChromaDB for vector storage
- Built with Streamlit
- Clean, single-query interface
- Environment variables via
.env - API keys are never committed to GitHub
| Layer | Technology |
|---|---|
| Backend | FastAPI |
| Frontend | Streamlit |
| AI / RAG | LangChain |
| Vector Database | ChromaDB |
| Embeddings | Sentence-Transformers |
| LLM | Groq (LLaMA 3.3 70B) |
| Web Scraping | BeautifulSoup4 & Playwright |
| Configuration | python-dotenv |
- Enter a website URL
- Click Index Website
- Website content is crawled, chunked, and embedded
Ask natural-language questions such as:
- What is this website about?
- List all services mentioned
- Who is the author?
The system returns accurate, grounded answers based only on the indexed website content.
- Website is crawled and text is extracted
- Text is split into manageable chunks
- Embeddings are generated and stored in ChromaDB
- User query retrieves the most relevant chunks
- LLM generates an answer using retrieved context
This is true Retrieval-Augmented Generation (RAG).
- Website documentation Q&A
- Internal knowledge bases
- Research and analysis
- Client website intelligence
- Portfolio / demo RAG application
CrawlAI RAG
Built by Ankit Kumar Nayak
If you like this project:
- Give it a star
- Fork it
- Contribute or suggest improvements
