- Introduction & Project Philosophy
- Directory Structure
- System Architecture & Data Flow
- Agent Details
- Technology Stack
- Visual Documentation
- Local Installation Guide
- Roadmap
- Author
- License
Working with data usually means dealing with scattered sources: PDFs, wiki pages, SQL databases, CSVs, and the internet. Answering a real question often means pulling something from each of them, understanding the context, and then combining it all into a clear answer.
The Agentic Data Pipeline is built around this idea. Instead of relying on a single “super model,” the system uses multiple small, focused AI agents, each good at one thing—document retrieval, SQL queries, web search, or reasoning. Their outputs are combined by a final synthesizer agent that produces a single coherent answer.
The goal is simple:
Break the problem into specialized steps, run them in parallel, and combine the results intelligently.
- Frontend UI: https://agenticdatapipeline.lovable.app
- Backend API (Swagger): https://rauhan-agenticdatapipeline.hf.space/docs
Only core project files are included — unnecessary system files (.DS_Store, .venv/, caches) have been removed.
AgenticDataPipeline/
├── LICENSE
├── README.md
├── requirements.txt
├── Dockerfile
├── prompts.yaml
├── config.ini
├── .gitignore
├── .python-version
├── main.py
│
├── demo/
│ ├── demo1.png
│ ├── demo2.png
│ ├── demo3.png
│ ├── demo4.png
│ ├── demo5.png
│ ├── workflowDiagram.png
│ ├── frontend.png
│ ├── fastapiSwaggerUI.png
│ ├── langsmithDashboard.png
│ ├── langsmithTracing.png
│ ├── langgrapphMermaidExport.png
│ ├── postgreSQLAgentData.png
│ └── qdrantVectoDB.png
│
├── internalDocs/
│ ├── ai_job_dataset.csv
│ ├── Algorithm - Wikipedia.pdf
│ ├── Computer science - Wikipedia.pdf
│ ├── Data structure - Wikipedia.pdf
│ ├── Database - Wikipedia.pdf
│ ├── Operating system - Wikipedia.pdf
│ └── Programming language - Wikipedia.pdf
│
├── utils/
│ ├── __init__.py
│ ├── logger.py
│ ├── exceptions.py
│ └── initMethods.py
│
├── api/
│ ├── __init__.py
│ ├── models.py
│ └── services.py
│
├── notebooks/
│ ├── VectorDBPopulator.ipynb
│ ├── SQLPoplulator.ipynb
│ └── IndividualAgents.ipynb
│
└── src/
├── __init__.py
├── workflows/
│ ├── __init__.py
│ └── workflow.py
└── components/
├── __init__.py
├── ragAgent.py
├── sqlAgent.py
├── reasoningAgent.py
├── internetSearchAgent.py
└── synthesizerAgent.py
The system is implemented as a stateful Directed Acyclic Graph (DAG) using LangGraph, with FastAPI as the public-facing interface.
All agents exchange information through a shared state object:
class AgentState(TypedDict):
query: str
ragResults: Optional[str]
sqlResults: Optional[str]
webResults: Optional[str]
reasoningResults: Optional[str]
finalAnswer: Optional[str]-
The
/answerQueryAPI receives a query. -
A fresh AgentState is created.
-
LangGraph fans out into four parallel agents:
- RAG
- PostgreSQL
- Internet Search
- Reasoning
-
Each agent writes back into the state.
-
LangGraph waits until all four complete.
-
The Synthesizer Agent merges all outputs.
-
The final answer is returned.
-
Model: Llama 3.3 70B
-
Works with internal PDFs, wikis, and other documents
-
Retrieval:
- Dense:
BAAI/bge-large-en-v1.5→ Qdrant - Sparse: BM25
- Dense:
-
Output →
ragResults
- Model: zai-glm-4.6
- Natural-language-to-SQL
- Uses SQLAlchemy + LangChain SQL toolkit
- Output →
sqlResults
- Model: qwen-3-32b
- Uses chain-of-thought reasoning
- Output →
reasoningResults
- Uses Google Serper API
- Output →
webResults
-
Model: gpt-oss-120b
-
Merges everything using:
- Internal docs + SQL
- Internet Search
- Reasoning
-
Output →
finalAnswer
| Layer | Technology | Reason |
|---|---|---|
| API | FastAPI, Uvicorn | Fast, async, clean docs |
| Workflow | LangGraph | Parallel stateful orchestration |
| VectorDB | Qdrant | Dense + sparse hybrid retrieval |
| DB | PostgreSQL | Reliable relational backend |
| LLM Hosting | Cerebras | Fast and affordable inference |
| Containers | Docker | Reproducible deployments |
These screenshots show how the system responds to different user queries.
- Python 3.12+
- Docker Engine
git clone https://github.com/RauhanAhmed/AgenticDataPipeline.git
cd AgenticDataPipelineYou’ll need keys for:
- QDRANT_API_KEY
- QDRANT_URL
- POSTGRE_CONNECTION_STRING
- CEREBRAS_API_KEY
- SERPER_API_KEY
- Additional LLM keys as needed
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtdocker build -t agentic-data-pipeline .
docker run -p 7860:7860 -d --env-file .env agentic-data-pipelinenotebooks/VectorDBPopulator.ipynb
notebooks/SQLPoplulator.ipynb
curl -X POST http://localhost:7860/answerQuery \
-H "Content-Type: application/json" \
-d '{"query": "How many data scientist jobs are available in California?"}'- Add Reciprocal Rank Fusion (RRF)
- Feedback-driven fine-tuning loop
- New specialized agents (filesystem, JIRA, etc.)
- Optional Streamlit/Gradio UI
Built by Rauhan Ahmed Portfolio: https://rauhanahmed.in
Contributions welcome — feel free to open an issue or PR.
MIT License. See LICENSE.












