A fast web-search RAG workbench inspired by ChatGPT Search:
- rewrites the user question into targeted search queries
- searches multiple providers concurrently
- fetches and extracts source pages in parallel
- uses a lightweight query planner before answer generation
- exposes search controls for domains, recency, locale, and citation verifier choice
- evaluates retrieval quality with a CRAG-style corrective pass
- runs multi-step retrieval in Deep Research mode
- integrates with Chromium through a local search URL and unpacked extension
- fuses multi-query search results with reciprocal rank fusion before fetching
- ranks passages with source-aware contextual BM25 signals
- packs answer context with query-aware compression and long-context reordering
- reranks passages with hybrid lexical scoring and source-quality signals
- generates a cited answer from retrieved evidence and returns claim-level citation checks
- falls back to extractive answers when no LLM API key is configured
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python -m fast_rag.appOpen http://127.0.0.1:8000.
SignalRAG exposes a search-engine-compatible URL:
http://127.0.0.1:8000/engine?q=%s&mode=pro
It also includes a Manifest V3 extension in
extensions/signalrag-chromium with:
sromnibox keyword search.- selected-text context menu search.
- browser side panel search.
- extension options for the local API URL and default mode.
Load it from chrome://extensions with Developer mode and "Load unpacked".
See extensions/signalrag-chromium/README.md.
Set these environment variables before starting the server:
export OPENAI_API_KEY="..."
export OPENAI_MODEL="gpt-4.1-mini"
export DEEPSEEK_API_KEY="..."
export DEEPSEEK_MODEL="deepseek-v4-flash"
export DEEPSEEK_PLANNER_MODEL="deepseek-v4-flash"
export DEEPSEEK_VERIFIER_MODEL="deepseek-v4-flash"
export BRAVE_API_KEY="..."Without API keys, the app still works using DuckDuckGo, Bing, and Yahoo HTML search fallbacks plus an extractive cited answer.
If both DeepSeek and OpenAI keys are present, DeepSeek is used first by default. Override with:
export LLM_PROVIDER="openai" # or deepseek / autocurl -X POST http://127.0.0.1:8000/api/search \
-H "Content-Type: application/json" \
-d '{
"query":"how does ChatGPT search work",
"mode":"pro",
"lens":"official",
"max_results":10,
"include_domains":["openai.com","help.openai.com"],
"exclude_domains":["medium.com"],
"recency":"year",
"country":"us",
"language":"en",
"citation_verifier":"auto"
}'Modes:
fast: low latency, fewer pages, shorter timeouts.pro: balanced mode for fresher, comparative, or multi-hop questions.deep: Deep Research mode. It builds several focused research steps, runs them in parallel, dedupes the evidence, and returnsresearch_trace.
The API response includes query_plan, which reports the planner's inferred intent,
freshness need, search depth, and DeepSeek reasoning effort:
none: thinking disabled for simple lookups.high: thinking enabled for comparisons, recommendations, API/code guidance, multi-hop synthesis, or uncertainty.max: thinking enabled with max effort only for deep research, long-horizon tasks, formal proof, or many constraints.
Deep Research mode always enables DeepSeek thinking mode with at least
reasoning_effort: high, and uses max only when the planner marks the query
as a complex long-horizon or max-effort task. This keeps Deep Research stable
while still using DeepSeek's OpenAI-format controls:
{"thinking":{"type":"enabled"}} plus reasoning_effort of high or max.
The request can include:
lens:web,official,academic,forums,news,pdf, orfinance. Lenses add intent-specific query rewrites while still respecting explicit domain, recency, country, and language controls.include_domains/exclude_domains: allowlist or denylist domains.recency:any,day,week,month, oryear.country/language: two-letter locale hints for supported providers.citation_verifier:auto,lexical, ordeepseek.
The response includes:
crag: retrieval quality before and, if needed, after a corrective search.research_trace: the per-step trace used by Deep Research mode.context_packing: answer-context compression stats, including strategy, packed evidence count, budget, packed characters, and compression ratio.candidate_citations[].signals.trust_tier: source credibility tier used by ranking and CRAG, such asgovernment,academic,standards,official_docs,medical,news_wire,reference,general, orlow_signal.claim_citations: per-claim citation trace. With DeepSeek configured,autouses a judge model for supported/weak/contradicted/insufficient decisions; otherwise it falls back to the fast lexical verifier.
SignalRAG scores source credibility before final passage ranking. The trust tiers are intentionally conservative:
government:.gov,.mil,.edu-adjacent public institutions, major public agencies, regulators, and intergovernmental institutions such as CDC, FDA, NIH, NIST, SEC, WHO, UN, IMF, OECD, and World Bank.academic: research repositories, journals, and scholarly publishers such as arXiv, ACL Anthology, Nature, Science, NEJM, JAMA, BMJ, PubMed/NCBI, Cell, Springer, and ScienceDirect.standards: standards and security bodies such as W3C, IETF, ISO, OWASP, MITRE, NIST CSRC, and CISA.official_docs: first-party product or developer documentation, including OpenAI, DeepSeek, Anthropic, Google, Microsoft, AWS, GitHub, Python, Perplexity, Tavily, LlamaIndex, and Ragas docs.medical: evidence-oriented public medical references such as MedlinePlus, Mayo Clinic, Cleveland Clinic, MSD Manuals, and NCI.news_wire: high-accountability news and public media sources such as AP, Reuters, BBC, and NPR.reference: broad reference sources such as Britannica and Wikipedia. These are useful for orientation but are boosted less than primary sources.low_signal: social, forum, or open publishing domains such as Reddit, Quora, Medium, Substack, and Pinterest. These can still be useful for experience-oriented queries, but are not treated as authoritative evidence.
These tiers are based on source-evaluation principles from Google Search's
E-E-A-T guidance, academic credibility guidance that prioritizes .edu, .gov,
and peer-reviewed evidence, and NCI guidance that health information should
come from government agencies, hospitals, universities, medical journals, and
professional societies.
This roadmap is based on current RAG research and search-product patterns:
-
Source lenses and search controls: advanced search products expose domain/date/location/source controls and reusable lenses. SignalRAG now supports first-pass source lenses for web, official, academic, forums, news, PDFs, and finance. Next step: make lenses editable and persist custom lens presets in the UI.
- Reference: Kagi Lenses
- Reference: Perplexity Search Filters
- Reference: Tavily Search API
-
Contextual retrieval: enrich extracted passages with source title, domain, and snippet context before ranking. SignalRAG now applies source-aware contextual BM25 signals before hybrid scoring. Next step: persist section/date metadata and add optional embeddings plus a cross-encoder reranker. Anthropic reports that contextual BM25 + contextual embeddings + reranking substantially reduces retrieval misses.
- Reference: Anthropic Contextual Retrieval
-
Stronger reranking: keep the current fast BM25-style scorer for first pass, then add an optional cross-encoder or LLM reranker for Pro/Deep modes. Use this only after broad recall, so latency stays controlled.
-
Query decomposition and RAG-Fusion: generate multiple targeted queries for complex questions, then fuse ranked results with reciprocal rank fusion. SignalRAG now applies RRF before page fetching, so URLs that appear across several query rewrites are prioritized before extraction and passage ranking. Next step: add LLM-generated subquestions for high-complexity queries and tune fusion weights by lens/provider.
- Reference: LlamaIndex Query Transformations
- Reference: RAG-Fusion
-
Self-reflection and correction loop: extend the current CRAG pass so the model can decide whether retrieval is needed, whether evidence is sufficient, and whether the draft answer needs another retrieval pass.
- Reference: Self-RAG
- Reference: Corrective RAG
-
Citation and faithfulness evaluation: keep claim-level citation checks, then add offline regression metrics for context relevance, groundedness, answer relevance, faithfulness, and contextual recall.
- Reference: TruLens RAG Triad
- Reference: DeepEval RAG Evaluation
-
Context packing: avoid dumping too much evidence into the final prompt. SignalRAG now uses query-aware extractive compression, adds source context to every packed passage, and reorders evidence in a "sandwich" pattern so strong sources appear near the beginning and end of the model context. This follows the same practical lesson as LongLLMLingua and lost-in-the-middle research: key information density and position matter, even with long-context models. Next step: add optional LLMLingua-style small-model compression for very large reports.
- Reference: Lost in the Middle
- Reference: LongLLMLingua
-
Deep Research UX and reasoning: expose a visible plan, progress trace, source controls, exportable report, table of contents, and source list for review. SignalRAG now runs deeper research steps, including countercheck and synthesis, and uses adaptive DeepSeek thinking for final synthesis:
highby default,maxfor long-horizon tasks.- Reference: ChatGPT Deep Research
- Reference: Perplexity Sonar Deep Research
- Reference: DeepSeek Thinking Mode
These examples were run locally on 2026-05-12 with DeepSeek enabled, HTML search fallbacks, and no Brave API key. They are the best demo cases because they use official sources, produce inline citations, and exercise different parts of the retrieval stack.
| Use case | Query | Mode | Observed result |
|---|---|---|---|
| Fast official API lookup | DeepSeek API chat completion base URL model name and first API call |
fast |
2 official DeepSeek citations, 3 supported claims, ~4.0s |
| Product search explanation | How does ChatGPT search work and how does it cite sources? |
pro |
2 official OpenAI citations, 3 supported claims, ~9.2s |
| API docs with source controls | OpenAI web search API citations and domain filtering |
pro |
OpenAI developer citation, planner chose high reasoning, ~12.6s |
| Deep Research trace | Explain ChatGPT search for Enterprise and Edu data sharing and source citations. |
deep |
2 official OpenAI citations, 3 research steps, 5 supported claims, ~19.2s |
curl -X POST http://127.0.0.1:8000/api/search \
-H "Content-Type: application/json" \
-d '{
"query":"DeepSeek API chat completion base URL model name and first API call",
"mode":"fast",
"max_results":8,
"include_domains":["api-docs.deepseek.com"],
"recency":"year",
"country":"us",
"language":"en",
"citation_verifier":"auto"
}'Why this is a strong demo: it shows the lightweight planner choosing
reasoning_effort: none, keeps latency low, and returns only official
DeepSeek API documentation as citations.
curl -X POST http://127.0.0.1:8000/api/search \
-H "Content-Type: application/json" \
-d '{
"query":"How does ChatGPT search work and how does it cite sources?",
"mode":"pro",
"max_results":10,
"include_domains":["openai.com","help.openai.com"],
"recency":"year",
"country":"us",
"language":"en",
"citation_verifier":"auto"
}'Why this is a strong demo: it exercises official-source prioritization, answer citations, and claim-level verification against OpenAI Help Center and OpenAI announcement pages.
curl -X POST http://127.0.0.1:8000/api/search \
-H "Content-Type: application/json" \
-d '{
"query":"OpenAI web search API citations and domain filtering",
"mode":"pro",
"max_results":10,
"include_domains":["developers.openai.com","platform.openai.com","openai.com"],
"recency":"year",
"country":"us",
"language":"en",
"citation_verifier":"auto"
}'Why this is a strong demo: it shows include-domain controls, freshness-aware planning for API documentation, and citation grounding from developer docs.
curl -X POST http://127.0.0.1:8000/api/search \
-H "Content-Type: application/json" \
-d '{
"query":"Explain ChatGPT search for Enterprise and Edu data sharing and source citations.",
"mode":"deep",
"max_results":12,
"include_domains":["help.openai.com","openai.com"],
"recency":"year",
"country":"us",
"language":"en",
"citation_verifier":"auto"
}'Why this is a strong demo: it runs Deep Research mode, returns a multi-step
research_trace, and verifies claims across workspace policy and citation
behavior sources.
python -m fast_rag.eval --mode fast --top-k 5
python -m fast_rag.eval --mode pro --top-k 8
python -m fast_rag.eval --mode deep --top-k 10The evaluator reports recall@k, hit rate, MRR, and latency over a small set of known-answer web-search cases.
For RAG systems, do not rely on a single benchmark number. A useful eval suite should cover both retrieval and generation:
- Retrieval: recall@k, hit rate, MRR/nDCG, source diversity, latency.
- Generation: answer relevance, faithfulness/groundedness, citation coverage, supported-claim rate, contradiction rate, fallback rate, and cost/latency.
- Dataset size: use 20-50 golden queries for early development, 100-300 for a release gate, and 500+ mixed production traces once real usage exists. Keep separate slices for factual lookup, API docs, fresh/news, comparison, multi-hop, Deep Research, and adversarial/no-answer cases.
SignalRAG includes a small end-to-end benchmark runner. There are two useful 50-case suites:
extended: a golden regression suite. It intentionally repeats known topics and expected sources, so it is good for catching regressions but should not be presented as representative user traffic.realistic: a short-query suite with no include-domain allowlists. It uses everyday search-box phrasing such aspython read json file,tsa liquids rule carry on, andgit rebase vs merge. This is still hand-curated, not production telemetry.
For truly real query distributions, use anonymized logs or public datasets such as MS MARCO, whose questions come from anonymized Bing queries, Natural Questions, whose questions come from anonymized aggregated Google queries, and BEIR, which combines diverse retrieval tasks.
Golden regression run:
python -m fast_rag.benchmark \
--api-base http://127.0.0.1:8000 \
--suite extended \
--clear-response-cache \
--timeout 220 \
--output benchmark_results/signalrag-benchmark-2026-05-12-50cases.jsonLatest golden 50-case local run with DeepSeek enabled, HTML search fallbacks, no Brave API key, and response cache cleared at the start:
| Metric | Result |
|---|---|
| Cases | 50 |
| Expected source recall | 0.9067 |
| Used source recall | 0.7567 |
| Answer term coverage | 0.7533 |
| Citation coverage | 0.8139 |
| Supported claim rate | 0.7906 |
| Review claim rate | 0.2042 |
| CRAG sufficient rate | 0.9600 |
| Fallback rate | 0.0400 |
| Cache hit rate | 0.0400 |
| Average latency | 15.4s |
| P95 latency | 41.3s |
Interpretation: the larger suite keeps P95 latency at the previous 6-case level while lowering average latency to 15.4s. Expected-source recall stays above 0.90, citation coverage remains above 0.81, and only 2 of 50 cases fell back to extractive synthesis. Cache hits are intentionally low in this cold-start extended suite because the cases are broader paraphrases rather than repeated queries. A warm-cache repeat of the same 50 cases reached 100% response-cache hit rate with about 4ms average API wall time.
Realistic short-query run:
python -m fast_rag.benchmark \
--api-base http://127.0.0.1:8000 \
--suite realistic \
--clear-response-cache \
--timeout 220 \
--output benchmark_results/signalrag-benchmark-2026-05-12-realistic-50cases.jsonLatest realistic 50-case local run with DeepSeek enabled, DuckDuckGo/Bing/Yahoo HTML search fallbacks, no Brave API key, and response cache cleared at the start:
| Metric | Result |
|---|---|
| Cases | 50 |
| Source-scored cases | 50 |
| Expected source recall | 0.9733 |
| Used source recall | 0.9467 |
| Answer term coverage | 0.9800 |
| Citation coverage | 0.8818 |
| Supported claim rate | 0.8656 |
| Review claim rate | 0.1320 |
| CRAG sufficient rate | 0.9800 |
| Fallback rate | 0.0000 |
| Cache hit rate | 0.0000 |
| Average latency | 7.4s |
| P95 latency | 11.7s |
Interpretation: the realistic suite is harsher and more useful for product work. It exposed that DuckDuckGo-only HTML search can return zero results or bad results for short real queries, so SignalRAG now queries Bing and Yahoo HTML fallbacks in parallel and avoids truncating one provider's results before fusion. SignalRAG also applies authority-aware query rewrites, trust-aware pre-fetch reranking, and a small high-confidence official source router for navigational documentation queries. The source router now covers vertical authority sources such as Git, PostgreSQL, AWS, IETF/OAuth, W3C, OWASP, FTC, CFPB, SEC, NIH, NASA, NOAA, and Perplexity docs. This lifted expected-source recall from 0.60 to 0.9733. The answer layer now prioritizes primary/official evidence in the final answer context and conservatively augments citations when a primary source directly supports a cited claim. That lifted used-source recall from 0.32 to 0.9467 and supported-claim rate from 0.7230 to 0.8656 on the realistic suite.
SignalRAG uses three cache layers:
- Page cache: fetched web pages are persisted in SQLite for reuse during retrieval and reranking.
- Planner cache: query plans are cached by normalized query and mode, so repeat requests do not call the lightweight planner model again.
- Response cache: final API responses are persisted in SQLite and reused for exact or safe fuzzy matches.
The response cache is intentionally conservative but high-hit:
- Canonical exact hits ignore case, punctuation, whitespace, and domain-list ordering.
- Safe fuzzy hits allow light rewording such as singular/plural variants and reordered wording, but only when mode, lens, filters, locale, verifier, intent tags, and numeric tokens still match.
- Fresh queries, day/week recency, and queries containing current/latest/today wording do not use fuzzy cache matches.
- Cached responses return
meta.cache_hit,meta.cache_strategy,meta.cache_similarity,meta.cache_age_seconds, andmeta.cache_source_query.
Accuracy comes from grounding every answer in retrieved passages, checking retrieval quality before generation, fusing multi-query search results with RRF, ranking with contextual BM25 signals, packing answer context with query-aware compression, inheriting paragraph-level citations during claim checks, and returning only used citations by default. Speed comes from short timeouts, request concurrency, page caching, persistent smart response caching, planner caching, adaptive citation judging, early reranking, compression before generation, and the planner choosing the cheapest mode that fits the query. For production, use a paid search API such as Brave or Tavily, add an embedding or cross-encoder reranker, and persist traces for evaluation.