Summary
The current codebase has no retrieval layer. This issue tracks the implementation of a hybrid retrieval engine that fetches relevant GA4GH policy chunks from the existing ChromaDB vector store for
each compliance check.
What Will Be Implemented
- Semantic search via ChromaDB (meaning + paraphrasing)
- BM25 keyword search via rank-bm25 (exact legal terms)
- EnsembleRetriever combining both 50/50
- Category and subcategory filtering to scope results per check type
- Tests to verify retrieval correctness
Why Hybrid Search
Pure semantic search misses exact legal terms like "opt-out", "withdrawal", "de-identification" that are critical for compliance checking. BM25 handles this. Combined they cover both meaning and exact terminology.
Builds On
#10 - document ingestion pipeline