A high-performance, production-ready Retrieval-Augmented Generation (RAG) system built with FastAPI, featuring hybrid search, semantic reranking, and bilingual support (French/Arabic).
Built with FastAPI + Python + DeepSeek/Gemini + FAISS + BGE-M3.
Flask-Forsa is a production-grade RAG system designed for high-accuracy question answering with strict source attribution.
- Hybrid Retrieval β Semantic + Lexical search fusion with MinMax normalization
- Smart Reranking β BGE-reranker-v2-m3 cross-encoder for precision
- Bilingual Support β Native French & Arabic processing
- Parallel Processing β Handle 50+ questions concurrently via ThreadPoolExecutor
- JSON Mode β Structured, reliable LLM outputs with validation
- Answer Validation β Grounding checks & hallucination detection
- Source Attribution β Evidence-based answers with score-based citations
- Document Processing β PDF, DOCX, TXT support with Gemini semantic chunking
Everything runs locally with full control over your data pipeline.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FastAPI REST API Endpoint β
β (Structured Chat + asyncio.gather Parallelism) β
βββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββ΄ββββββββββββββββ
β β
βββββββββΌβββββββββ ββββββββΌβββββββ
β Query Service β β RAG Service β
β (Enhancement) β β (Hybrid) β
β β’ Normalize β β β’ Semantic β
β β’ Extract β β β’ Lexical β
β β’ Variations β β β’ Merge β
βββββββββ¬βββββββββ ββββββββ¬βββββββ
β β
β βββββββββββββββββββββββ€
β β β
βββββββββΌββββββββββΌββββ βββββββββββΌβββββββββββ
β Embedding Service β β Vector Store β
β (BGE-M3 1024d) β β (FAISS Index) β
β β’ GPU Accelerated β β β’ Flat Index β
β β’ Batch Processing β β β’ Cosine Sim β
βββββββββββββββββββββββ βββββββββββ¬βββββββββββ
β
βββββββββββΌβββββββββββ
β Reranker Service β
β (BGE Reranker V2) β
β β’ Cross-Encoder β
β β’ Score Override β
βββββββββββ¬βββββββββββ
β
βββββββββββΌβββββββββββ
β Validation Utils β
β β’ Scope Filter β
β β’ Grounding Check β
βββββββββββ¬βββββββββββ
β
βββββββββββΌβββββββββββ
β LLM Service β
β β’ DeepSeek V3.1 β
β β’ JSON Mode β
β β’ Retry Logic β
ββββββββββββββββββββββ
| Feature | Algorithm | Description |
|---|---|---|
| Hybrid Search | Weighted Semantic + Lexical | MinMax normalization with 0.65/0.35 weights |
| Reranking | BGE Cross-Encoder Scoring | Neural reranking with score replacement |
| Query Enhancement | Multi-pass Normalization | Key term extraction + variations |
| Scope Filtering | Entity + Family Detection | Pre-filters chunks by question context |
| Answer Validation | Grounding + Citation Analysis | Checks answer support in sources |
| Deduplication | Hash-based Chunk Keys | SHA1 content hashing prevents duplicates |
| Document Chunking | Gemini Semantic Splitting | Context-aware chunk boundaries |
| Parallel Execution | asyncio.gather + ThreadPool | 50 concurrent workers for I/O operations |
- Python 3.10+
- CUDA-compatible GPU (recommended for embeddings)
# Clone repository
git clone <repository-url>
cd flask-forsa
# Create virtual environment
python -m venv venv
# Activate (Windows)
venv\Scripts\activate
# Activate (Linux/Mac)
source venv/bin/activate
# Install dependencies
pip install -r requirements.txtCreate .env.local:
# LLM API Keys
DEEPSEEK_TOKEN=your_deepseek_api_token
DEEPSEEK_API_URL=https://api.modelarts-maas.com/v2/chat/completions
GEMINI_API_KEY=your_gemini_api_key# Add documents to data/ folder
mkdir -p data
# Copy your PDFs, DOCX files here
# Run preprocessing with Gemini
python preprocess.py# Index your preprocessed chunks
python build_index.py# Start FastAPI server
python main.py
# Server runs at: http://localhost:8000
# API Docs: http://localhost:8000/docs# In a new terminal
python gradio_app.py
# UI runs at: http://localhost:7860Endpoint: POST /api/chat/
{
"equipe": "GhostTruth",
"question": {
"category_1": {
"q1": "Quel est le dossier administratif requis pour Idoom?",
"q2": "Quels sont les avantages de l'offre Idoom Fibre?"
},
"category_2": {
"q1": "Comment souscrire Γ Idoom 4G LTE?"
}
},
"include_sources": false
}{
"equipe": "GhostTruth",
"reponses": {
"category_1": {
"q1": "Le dossier administratif comprend...",
"q2": "Les avantages de l'offre Idoom Fibre incluent..."
},
"category_2": {
"q1": "Pour souscrire Γ Idoom 4G LTE, vous devez..."
}
},
"sources": null
}Set "include_sources": true:
{
"sources": {
"category_1": {
"q1": [
{
"file": "guide_idoom_2024.pdf",
"score": 0.8543,
"evidence": "Le dossier administratif requis comprend une copie de la CIN..."
},
{
"file": "procedures_algerie_telecom.pdf",
"score": 0.7821,
"evidence": "Documents nΓ©cessaires : justificatif de domicile, CIN..."
}
]
}
}
}Edit config.py to customize behavior:
TOP_K_SEMANTIC = 20 # Semantic search candidates
TOP_K_LEXICAL = 20 # Lexical search candidates
HYBRID_W_SEM = 0.65 # Semantic weight (0-1)
HYBRID_W_LEX = 0.35 # Lexical weight (0-1)
SIMILARITY_THRESHOLD = 0 # Minimum hybrid score (0 = disabled)USE_RERANKER = False # Enable/disable reranking
RERANKER_MODEL = "BAAI/bge-reranker-v2-m3" # Cross-encoder modelENABLE_ANSWER_FIXER = False # Strict RAG post-processing (not used)
ANSWER_FIXER_TEMPERATURE = 0.1 # LLM temperature for fixesEMBEDDING_MODEL_NAME = "BAAI/bge-m3" # Embedding model
EMBEDDING_DIMENSION = 1024 # Embedding dimensions
INDEX_TYPE = "Flat" # FAISS index type
FAISS_INDEX_PATH = "db/faiss_index.bin"MIN_SOURCES = 1 # Minimum sources to return
MAX_SOURCES = 6 # Maximum sources to returnMODEL_NAME = "deepseek-v3.1" # DeepSeek model
GEMINI_MODEL = "gemini-2.5-flash" # Gemini model for chunking
ENABLE_LLM_WARMUP = False # Warmup on startup| Metric | Performance |
|---|---|
| Parallel Questions | Up to 50 concurrent |
| Avg Latency (uncached) | ~45-65s per question |
| Retrieval Speed | ~50s (hybrid + rerank) |
| Reranking Speed | ~40-60s (BGE reranker) |
| LLM Generation | 1-3s (DeepSeek) |
| Embedding Batch (GPU) | ~100 docs/s |
| Index Search (FAISS) | <1s |
Note: Current configuration has reranking disabled (USE_RERANKER = False) and threshold at 0.
Automatically processes queries via query_service:
- Normalization: Lowercase, diacritics removal
- Key Terms: Extracts important words for lexical search
- Variations: Generates alternative phrasings
- Type Detection: Identifies question category
Intelligently filters chunks via validation_utils:
- Entity Detection: Companies, products, services
- Document Families: Categorizes by topic
- Question Type: Matches chunk relevance
Multi-layer validation in validation_utils:
- Grounding Check: Verifies answer in sources
- Hallucination Detection: Flags unsupported claims
- Availability Check: Detects "no info" responses
- Citation Accuracy: Validates source references
Advanced chunking with Gemini in preprocess.py:
- Semantic Boundaries: Context-aware splits
- Metadata Extraction: File, page, section info
- Multi-format: PDF, DOCX, TXT support via
document_processor
flask-forsa/
βββ api/
β βββ chat.py # Main structured chat endpoint
β βββ embeddings.py # Embedding utilities
β βββ cache.py # Cache management endpoints
βββ services/
β βββ rag_service.py # Hybrid retrieval engine
β βββ llm_service.py # LLM integration (DeepSeek)
β βββ embedding_service.py # BGE-M3 embeddings
β βββ vector_store.py # FAISS vector operations
β βββ reranker_service.py # BGE reranker (optional)
β βββ query_service.py # Query enhancement
β βββ document_processor.py # PDF/DOCX extraction
βββ utils/
β βββ validation_utils.py # Answer validation & scope filtering
β βββ answer_fixer.py # Post-processing (not used)
β βββ logging_utils.py # Performance tracking decorator
βββ data/ # Document storage
βββ db/ # FAISS index + metadata
βββ schemas.py # Pydantic models
βββ config.py # Configuration
βββ prompts.py # System prompts
βββ preprocess.py # Gemini-based document chunking
βββ build_index.py # FAISS index builder
βββ main.py # FastAPI app
βββ gradio_app.py # Gradio UI
βββ test_preprocess.py # Gemini test suite
βββ README.md # This file
python test_preprocess.pycurl -X POST http://localhost:8000/api/chat/ \
-H "Content-Type: application/json" \
-d '{
"equipe": "TestTeam",
"question": {
"test": {
"q1": "Comment souscrire Γ Idoom?"
}
},
"include_sources": true
}'# Terminal 1: Start API
python main.py
# Terminal 2: Start Gradio
python gradio_app.pyVisit http://localhost:7860 for the web interface.
In config.py:
DEBUG_LOG_RETRIEVAL = True- Retrieval Transparency: Hybrid scores, semantic/lexical breakdown
- Query Enhancement: Normalization steps, key terms, variations
- Validation Results: Scope filtering, grounding checks
- Performance Metrics: Timing for each pipeline stage via
@log_execution_time - Token Usage: DeepSeek input/output tokens
Logs are configured via logger_config.py and output to console with INFO level.
This is a private project. Unauthorized use, reproduction, or distribution is prohibited.
For authorized contributors:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit changes (
git commit -m 'Add AmazingFeature') - Push to branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Β© 2025. All rights reserved.
Unauthorized use, reproduction, modification, or distribution
of this project or its source code is strictly prohibited.
AI & ML Models:
- BAAI/bge-m3 β Multilingual embeddings (1024d)
- BAAI/bge-reranker-v2-m3 β Cross-encoder reranking
- DeepSeek V3.1 β Advanced language model with JSON mode
- Google Gemini 2.5 Flash β Semantic document chunking
Infrastructure:
- FAISS β Facebook AI vector similarity search
- FastAPI β Modern async Python web framework
- Gradio β ML web interfaces
- PyMuPDF β PDF processing and text extraction
- Transformers β HuggingFace ML model library
Built with β€οΈ for production-grade RAG applications
Powered by hybrid search, semantic reranking, and bilingual intelligence.