Skip to content

feat: Optional pgvector hybrid/semantic search with embedding support#713

Open
pandor4u wants to merge 3 commits intomoqui:masterfrom
pandor4u:feature/pgvector-hybrid-search
Open

feat: Optional pgvector hybrid/semantic search with embedding support#713
pandor4u wants to merge 3 commits intomoqui:masterfrom
pandor4u:feature/pgvector-hybrid-search

Conversation

@pandor4u
Copy link
Copy Markdown

pgvector Hybrid & Semantic Search

Depends on: #700

Problem

PostgreSQL full-text search is keyword-based — it misses semantically related content (e.g., searching "car" won't find documents about "automobile"). No support for similarity-based retrieval.

Solution

When pgvector is available and an embedding API is configured, automatically:

  • Vector column: Adds embedding vector(N) column to moqui_document (dimensions configurable, default 1536)
  • DiskANN index: Creates diskann index for fast approximate nearest-neighbor search (falls back to HNSW if pgvectorscale not available)
  • Auto-embedding: Documents are automatically embedded on index() via a configurable OpenAI-compatible /v1/embeddings endpoint
  • Hybrid search: hybridSearch(index, query, limit, keywordWeight, vectorWeight) combines keyword (tsvector) and semantic (cosine distance) results using Reciprocal Rank Fusion (RRF, k=60)
  • Pure vector search: vectorSearch(index, query, limit) for semantic-only retrieval

Configuration (moqui-conf.xml)

<database-conf>
  <cluster name="search" type="postgres"
           embedding-url="http://localhost:11434/v1/embeddings"
           embedding-model="text-embedding-3-small"
           embedding-dimensions="1536" />
</database-conf>

Falls back gracefully to keyword-only search when pgvector is not installed or no embedding URL is configured. Compatible with any OpenAI-compatible embeddings API (OpenAI, Ollama, vLLM, etc.).

…rnative

Simplified PR addressing review feedback:
- Consolidated duplicated INSERT...ON CONFLICT SQL into shared constant
- Removed fork-specific files (AGENTS.md, CLAUDE.md, GEMINI.md, docker compose, .gitignore)
- Moved PostgreSQL tests to separate opt-in suite (MoquiSuite untouched)
- Updated ElasticRequestLogFilter to use ElasticClient interface type
- 83 tests: 46 unit (query translation + SQL injection) + 37 integration

New files:
- PostgresElasticClient.groovy: Full ElasticClient impl using JSONB/tsvector
- ElasticQueryTranslator.groovy: ES Query DSL to PostgreSQL SQL
- PostgresSearchLogger.groovy: Log4j2 appender for PostgreSQL
- SearchEntities.xml: Entity definitions for search tables
- PostgresSearchSuite.groovy: Separate JUnit test suite
- PostgresSearchTranslatorTests.groovy: Unit tests
- PostgresElasticClientTests.groovy: Integration tests

Modified files:
- ElasticFacadeImpl.groovy: type=postgres routing
- ElasticRequestLogFilter.groovy: Interface type usage
- MoquiDefaultConf.xml: Postgres config + entity load
- moqui-conf-3.xsd: type attribute with elastic/postgres enum
- build.gradle: Test dependencies and suite include
…ne, parameterized queries

- Route HTTP logs to dedicated moqui_http_log table with typed columns
  instead of generic moqui_document JSONB storage
- Route deleteByQuery to dedicated tables (moqui_http_log, moqui_logs)
  with proper timestamp range extraction for nightly cleanup jobs
- Replace Java regex highlights with PostgreSQL ts_headline() for
  accurate, index-aware snippet generation
- Fix update() to read-merge-extract pattern so content_text stays
  consistent with extractContentText() used by index()
- Add searchHttpLogTable() for searching against dedicated HTTP log table
- Improve guessCastType() to inspect actual values (epoch millis,
  decimals, ISO dates) when field name heuristics are ambiguous
- Parameterize exists query (use ?? operator) to prevent SQL injection
- Handle indexExists/createIndex/count for dedicated table names
When the pgvector extension is installed and an embedding API is configured,
enables vector-based document search alongside traditional keyword search:

Schema:
- Adds 'embedding vector(N)' column to moqui_document (configurable dimensions)
- Creates DiskANN index (pgvectorscale) or HNSW index (pgvector fallback)

Methods:
- hybridSearch(): Reciprocal Rank Fusion combining keyword (ts_rank_cd) and
  semantic (cosine distance) results in a single SQL query with configurable
  keyword/vector weights
- vectorSearch(): Pure semantic search using pgvector cosine distance
- generateEmbedding(): Calls OpenAI-compatible embedding API endpoint
- Embedding auto-generation on index() when vector search is enabled

Configuration (MoquiConf.xml):
  <cluster name="default" type="postgres" url="transactional"
           embedding-url="http://localhost:11434/v1/embeddings"
           embedding-model="nomic-embed-text"
           embedding-dimensions="768"/>

XSD schema updated with embedding-url, embedding-model, embedding-dimensions
attributes. Falls back gracefully to keyword-only search when pgvector or
the embedding API is not available.
@pandor4u
Copy link
Copy Markdown
Author

Detailed Change Walkthrough

This PR adds optional vector/semantic search using pgvector + pgvectorscale, enabling hybrid search that combines keyword matching with semantic similarity via Reciprocal Rank Fusion (RRF).

Why this change?

PostgreSQL's full-text search is purely keyword-based — it matches documents containing the exact stems of query terms. This creates a fundamental relevance gap:

  • Searching "car" won't find documents about "automobile" or "vehicle"
  • Searching "how to fix memory issues" won't match "troubleshooting RAM problems"
  • Searching "machine learning" won't find related content about "neural networks" or "deep learning"

Semantic/vector search solves this by representing both queries and documents as high-dimensional vectors (embeddings) where meaning determines proximity, not exact words. Combining keyword + semantic search via hybrid fusion gives the best of both worlds: exact matches rank high, and semantically related content fills in gaps.

What changed and why

Configuration (moqui-conf-3.xsd + MoquiDefaultConf.xml)

Three new optional attributes on the <cluster> element:

<cluster name="default" type="postgres" url="transactional"
         embedding-url="http://localhost:11434/v1/embeddings"
         embedding-model="nomic-embed-text"
         embedding-dimensions="768"/>
  • embedding-url: OpenAI-compatible /v1/embeddings endpoint. This is the gate — if not set, all vector functionality is disabled. Supports OpenAI, Ollama, vLLM, LiteLLM, or any compatible API.
  • embedding-model: Model name sent in the API request (default: text-embedding-3-small).
  • embedding-dimensions: Vector dimensions (default: 1536). Must match the model's output.

The XSD includes full documentation annotations for each attribute.

PostgresElasticClient.groovy — Schema changes

Extension detection:

if (embeddingUrl) {
    stmt.execute("CREATE EXTENSION IF NOT EXISTS vector")
    hasPgVector = true
    try { stmt.execute("CREATE EXTENSION IF NOT EXISTS vectorscale CASCADE") }
    catch (Exception e) { /* HNSW fallback */ }
}
  • Only attempts pgvector if embeddingUrl is configured (no point installing the extension without an embedding source).
  • pgvectorscale is optional — provides DiskANN indexes (faster than HNSW for large datasets). Falls back to HNSW gracefully.

Vector column + index:

ALTER TABLE moqui_document ADD COLUMN IF NOT EXISTS embedding vector(1536)
CREATE INDEX IF NOT EXISTS idx_mq_doc_embed ON moqui_document USING diskann(embedding)
-- or HNSW fallback:
CREATE INDEX IF NOT EXISTS idx_mq_doc_embed ON moqui_document USING hnsw(embedding vector_cosine_ops)
  • vector(N) is pgvector's native type — stores N-dimensional float arrays efficiently.
  • DiskANN (preferred): Microsoft's graph-based ANN algorithm, available via pgvectorscale. Offers better build times and lower memory usage than HNSW on large datasets.
  • HNSW (fallback): Hierarchical Navigable Small World graph — pgvector's built-in ANN index. Excellent recall at the cost of higher memory.
  • vector_cosine_ops specifies cosine distance for similarity comparison.

Auto-embedding on index()

void index(String index, String _id, Map document) {
    // ... existing upsert logic ...
    if (hasPgVector && embeddingUrl) updateDocumentEmbedding(prefixedIndex, _id, contentText)
}

Every document indexed through the standard index() API automatically gets an embedding generated and stored. This ensures the vector column stays in sync with document content without requiring callers to change their code.

updateDocumentEmbedding(): Calls generateEmbedding() then runs:

UPDATE moqui_document SET embedding = ?::vector WHERE index_name = ? AND doc_id = ?

Wrapped in try/catch — embedding failures don't break the index operation. The ::vector cast converts the string representation to pgvector's native type.

generateEmbedding(text) — Embedding API client

float[] generateEmbedding(String text) {
    Map requestBody = [input: text, model: embeddingModel]
    // POST to embeddingUrl with JSON body
    // Parse OpenAI-compatible response: { data: [{ embedding: [...] }] }
}
  • Uses java.net.HttpURLConnection directly (no external HTTP client dependency).
  • 10s connect timeout, 30s read timeout — embedding APIs can be slow for long texts.
  • Returns null on any failure (logged at WARN level).
  • Parses the standard OpenAI response format (data[0].embedding).

hybridSearch() — RRF fusion

This is the main new search method. It combines keyword and vector results using Reciprocal Rank Fusion (RRF), the standard algorithm used by Elasticsearch 8.x, Pinecone, Weaviate, and others.

Algorithm:

RRF_score(doc) = kw_weight * 1/(k + keyword_rank) + vec_weight * 1/(k + vector_rank)

Where k=60 is the standard RRF constant (dampens the impact of top ranks).

SQL implementation:

WITH keyword AS (
    SELECT doc_id, ...,
           ROW_NUMBER() OVER (ORDER BY ts_rank_cd(content_tsv, websearch_to_tsquery(?, ...)) DESC) as rank
    FROM moqui_document WHERE content_tsv @@ websearch_to_tsquery(?)
    LIMIT {limit*3}
),
semantic AS (
    SELECT doc_id, ...,
           ROW_NUMBER() OVER (ORDER BY embedding <=> ?::vector) as rank
    FROM moqui_document WHERE embedding IS NOT NULL
    LIMIT {limit*3}
)
SELECT d.doc_id, d.document,
       COALESCE(kw_weight * 1.0/(60 + kw.rank), 0) +
       COALESCE(vec_weight * 1.0/(60 + sem.rank), 0) AS _score
FROM moqui_document d
LEFT JOIN keyword kw ON ...
LEFT JOIN semantic sem ON ...
WHERE kw.doc_id IS NOT NULL OR sem.doc_id IS NOT NULL
ORDER BY _score DESC LIMIT ?

Key design decisions:

  • CTEs: Both keyword and semantic searches run as CTEs with LIMIT {limit*3} to get a broader candidate set before RRF fusion narrows to the final limit.
  • <=> operator: pgvector's cosine distance operator — uses the DiskANN/HNSW index for O(log n) approximate nearest neighbor lookup.
  • Configurable weights: keywordWeight and vectorWeight (default 0.5/0.5) let callers tune the balance. Setting vectorWeight=0 gives pure keyword search; keywordWeight=0 gives pure semantic search.
  • LEFT JOIN fusion: Documents appearing in only one result set still get scored (the missing rank contributes 0).

vectorSearch() — Pure semantic search

Map vectorSearch(String index, String queryText, int limit = 10)

For cases where only semantic similarity matters (no keyword matching). Uses 1.0 - (embedding <=> ?::vector) as the score so higher = more similar (cosine similarity rather than cosine distance).

isVectorSearchEnabled()

Convenience method for callers to check if hybrid/vector search is available before calling hybridSearch().

What is NOT changed

  • Existing search() method is unchanged — all current search behavior works exactly as before. hybridSearch() and vectorSearch() are new additional methods.
  • No breaking API changes — the ElasticFacade.ElasticClient interface is not modified.
  • Documents without embeddings are silently skipped in vector queries (WHERE embedding IS NOT NULL).

Security

  • The embedding API URL is configured in moqui-conf.xml (server-side only, never exposed to clients).
  • No API keys are stored in code — the current implementation supports key-less local APIs (Ollama, vLLM). For OpenAI, an Authorization header would need to be added.
  • User input goes through ElasticQueryTranslator.cleanLuceneQuery() before being used in SQL.

Performance notes

  • Embedding generation adds latency to index() (~50-200ms per document for local Ollama, ~100-500ms for OpenAI API). This happens synchronously — a future enhancement could use async/background embedding.
  • Vector index (DiskANN/HNSW) provides O(log n) approximate nearest neighbor search — not a sequential scan.
  • The LIMIT {limit*3} in CTEs bounds the fusion work to at most 6 * limit candidate documents.
  • Storage: a 1536-dimensional vector adds ~6KB per document.

Depends on

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant