feat: Optional pgvector hybrid/semantic search with embedding support#713
feat: Optional pgvector hybrid/semantic search with embedding support#713pandor4u wants to merge 3 commits intomoqui:masterfrom
Conversation
…rnative Simplified PR addressing review feedback: - Consolidated duplicated INSERT...ON CONFLICT SQL into shared constant - Removed fork-specific files (AGENTS.md, CLAUDE.md, GEMINI.md, docker compose, .gitignore) - Moved PostgreSQL tests to separate opt-in suite (MoquiSuite untouched) - Updated ElasticRequestLogFilter to use ElasticClient interface type - 83 tests: 46 unit (query translation + SQL injection) + 37 integration New files: - PostgresElasticClient.groovy: Full ElasticClient impl using JSONB/tsvector - ElasticQueryTranslator.groovy: ES Query DSL to PostgreSQL SQL - PostgresSearchLogger.groovy: Log4j2 appender for PostgreSQL - SearchEntities.xml: Entity definitions for search tables - PostgresSearchSuite.groovy: Separate JUnit test suite - PostgresSearchTranslatorTests.groovy: Unit tests - PostgresElasticClientTests.groovy: Integration tests Modified files: - ElasticFacadeImpl.groovy: type=postgres routing - ElasticRequestLogFilter.groovy: Interface type usage - MoquiDefaultConf.xml: Postgres config + entity load - moqui-conf-3.xsd: type attribute with elastic/postgres enum - build.gradle: Test dependencies and suite include
…ne, parameterized queries - Route HTTP logs to dedicated moqui_http_log table with typed columns instead of generic moqui_document JSONB storage - Route deleteByQuery to dedicated tables (moqui_http_log, moqui_logs) with proper timestamp range extraction for nightly cleanup jobs - Replace Java regex highlights with PostgreSQL ts_headline() for accurate, index-aware snippet generation - Fix update() to read-merge-extract pattern so content_text stays consistent with extractContentText() used by index() - Add searchHttpLogTable() for searching against dedicated HTTP log table - Improve guessCastType() to inspect actual values (epoch millis, decimals, ISO dates) when field name heuristics are ambiguous - Parameterize exists query (use ?? operator) to prevent SQL injection - Handle indexExists/createIndex/count for dedicated table names
When the pgvector extension is installed and an embedding API is configured,
enables vector-based document search alongside traditional keyword search:
Schema:
- Adds 'embedding vector(N)' column to moqui_document (configurable dimensions)
- Creates DiskANN index (pgvectorscale) or HNSW index (pgvector fallback)
Methods:
- hybridSearch(): Reciprocal Rank Fusion combining keyword (ts_rank_cd) and
semantic (cosine distance) results in a single SQL query with configurable
keyword/vector weights
- vectorSearch(): Pure semantic search using pgvector cosine distance
- generateEmbedding(): Calls OpenAI-compatible embedding API endpoint
- Embedding auto-generation on index() when vector search is enabled
Configuration (MoquiConf.xml):
<cluster name="default" type="postgres" url="transactional"
embedding-url="http://localhost:11434/v1/embeddings"
embedding-model="nomic-embed-text"
embedding-dimensions="768"/>
XSD schema updated with embedding-url, embedding-model, embedding-dimensions
attributes. Falls back gracefully to keyword-only search when pgvector or
the embedding API is not available.
Detailed Change WalkthroughThis PR adds optional vector/semantic search using pgvector + pgvectorscale, enabling hybrid search that combines keyword matching with semantic similarity via Reciprocal Rank Fusion (RRF). Why this change?PostgreSQL's full-text search is purely keyword-based — it matches documents containing the exact stems of query terms. This creates a fundamental relevance gap:
Semantic/vector search solves this by representing both queries and documents as high-dimensional vectors (embeddings) where meaning determines proximity, not exact words. Combining keyword + semantic search via hybrid fusion gives the best of both worlds: exact matches rank high, and semantically related content fills in gaps. What changed and whyConfiguration (
|
pgvector Hybrid & Semantic Search
Depends on: #700
Problem
PostgreSQL full-text search is keyword-based — it misses semantically related content (e.g., searching "car" won't find documents about "automobile"). No support for similarity-based retrieval.
Solution
When pgvector is available and an embedding API is configured, automatically:
embedding vector(N)column tomoqui_document(dimensions configurable, default 1536)diskannindex for fast approximate nearest-neighbor search (falls back to HNSW if pgvectorscale not available)index()via a configurable OpenAI-compatible/v1/embeddingsendpointhybridSearch(index, query, limit, keywordWeight, vectorWeight)combines keyword (tsvector) and semantic (cosine distance) results using Reciprocal Rank Fusion (RRF, k=60)vectorSearch(index, query, limit)for semantic-only retrievalConfiguration (moqui-conf.xml)
Falls back gracefully to keyword-only search when pgvector is not installed or no embedding URL is configured. Compatible with any OpenAI-compatible embeddings API (OpenAI, Ollama, vLLM, etc.).