feat: Optional BM25 ranking via pg_textsearch extension by pandor4u · Pull Request #711 · moqui/moqui-framework

pandor4u · 2026-04-10T23:30:16Z

BM25 Ranking via pg_textsearch

Depends on: #700

Problem

The current PostgreSQL search backend uses ts_rank_cd() for scoring which lacks document-length normalization — a 10,000-word document mentioning "database" once scores similarly to a 50-word document mentioning it once.

Solution

When the pg_textsearch extension (v1.0.0, PostgreSQL 17+) is available, automatically:

Detect and enable the extension at startup
Create a BM25 index on moqui_document.content_text with english text config
Use BM25 scoring (content_text <@> to_bm25query()) instead of ts_rank_cd() for document search, providing proper document-length normalization with configurable k1/b parameters
Block-Max WAND optimization for fast top-k queries

Falls back gracefully to ts_rank_cd() when pg_textsearch is not installed. Reports BM25 availability in getServerInfo() response.

…rnative Simplified PR addressing review feedback: - Consolidated duplicated INSERT...ON CONFLICT SQL into shared constant - Removed fork-specific files (AGENTS.md, CLAUDE.md, GEMINI.md, docker compose, .gitignore) - Moved PostgreSQL tests to separate opt-in suite (MoquiSuite untouched) - Updated ElasticRequestLogFilter to use ElasticClient interface type - 83 tests: 46 unit (query translation + SQL injection) + 37 integration New files: - PostgresElasticClient.groovy: Full ElasticClient impl using JSONB/tsvector - ElasticQueryTranslator.groovy: ES Query DSL to PostgreSQL SQL - PostgresSearchLogger.groovy: Log4j2 appender for PostgreSQL - SearchEntities.xml: Entity definitions for search tables - PostgresSearchSuite.groovy: Separate JUnit test suite - PostgresSearchTranslatorTests.groovy: Unit tests - PostgresElasticClientTests.groovy: Integration tests Modified files: - ElasticFacadeImpl.groovy: type=postgres routing - ElasticRequestLogFilter.groovy: Interface type usage - MoquiDefaultConf.xml: Postgres config + entity load - moqui-conf-3.xsd: type attribute with elastic/postgres enum - build.gradle: Test dependencies and suite include

…ne, parameterized queries - Route HTTP logs to dedicated moqui_http_log table with typed columns instead of generic moqui_document JSONB storage - Route deleteByQuery to dedicated tables (moqui_http_log, moqui_logs) with proper timestamp range extraction for nightly cleanup jobs - Replace Java regex highlights with PostgreSQL ts_headline() for accurate, index-aware snippet generation - Fix update() to read-merge-extract pattern so content_text stays consistent with extractContentText() used by index() - Add searchHttpLogTable() for searching against dedicated HTTP log table - Improve guessCastType() to inspect actual values (epoch millis, decimals, ISO dates) when field name heuristics are ambiguous - Parameterize exists query (use ?? operator) to prevent SQL injection - Handle indexExists/createIndex/count for dedicated table names

When the pg_textsearch extension is available (PostgreSQL 17+), automatically: - Detect and enable the extension at startup - Create a BM25 index on moqui_document.content_text with english config - Use BM25 scoring (content_text <@> to_bm25query()) instead of ts_rank_cd() for document search queries, providing proper document-length normalization Falls back gracefully to ts_rank_cd() when pg_textsearch is not installed. Reports BM25 availability in getServerInfo() response. The BM25 index uses Block-Max WAND optimization for fast top-k queries and configurable parameters (k1=1.2, b=0.75 defaults).

pandor4u · 2026-04-10T23:34:56Z

Detailed Change Walkthrough

This PR adds optional BM25 ranking via the pg_textsearch extension (from ParadeDB). When installed, document search uses true BM25 scoring instead of PostgreSQL's built-in ts_rank_cd().

Why this change?

PostgreSQL's native ts_rank_cd() uses cover density ranking — it rewards terms appearing close together but doesn't account for term frequency normalization or inverse document frequency. In practice this means:

A document mentioning "database" 50 times scores the same as one mentioning it once (no TF)
Common words like "system" get the same weight as rare domain terms like "moqui" (no IDF)

BM25 (Best Match 25) is the standard ranking algorithm used by ElasticSearch/Lucene and is considered state-of-the-art for text relevance. It naturally handles TF saturation (diminishing returns for repeated terms) and IDF (rare terms get boosted). This makes search results significantly more relevant when the document corpus is large.

What changed and why

All changes are in PostgresElasticClient.groovy — this is a scoring-layer change, not a query-syntax change.

Extension detection (`initSchema`)

stmt.execute("CREATE EXTENSION IF NOT EXISTS pg_textsearch")
hasBm25Extension = true

Wrapped in try/catch — if pg_textsearch isn't installed, hasBm25Extension stays false and everything falls back to ts_rank_cd(). This is the same pattern used for pg_trgm detection.

BM25 index creation

CREATE INDEX IF NOT EXISTS idx_mq_doc_bm25
ON moqui_document USING bm25(content_text)
WITH (text_config='english')

pg_textsearch provides a custom index access method (USING bm25) that builds an inverted index optimized for BM25 scoring. The text_config='english' parameter applies the same English stemmer/stopwords as our existing tsvector config, ensuring consistent query behavior. If the index creation fails (e.g., the extension doesn't support the table layout), hasBm25Extension is set back to false.

Score selection (`buildScoreSelect`)

The method was refactored to accept a useBm25 parameter:

With BM25:

-(content_text <@> to_bm25query(?, 'idx_mq_doc_bm25'))

<@> is pg_textsearch's BM25 distance operator (lower = better match)
to_bm25query() converts search text to a BM25 query bound to the specific index
The negation (-) flips the score so higher = better match, consistent with ES conventions
The bind parameter receives the raw search text (not a tsquery expression)

Without BM25 (fallback):

ts_rank_cd(content_tsv, websearch_to_tsquery('english', ?))

Unchanged from PR #700.

Parameter routing (`search`)

BM25 and ts_rank_cd have different parameter needs:

BM25: to_bm25query(?, 'idx_mq_doc_bm25') takes the raw search text as a single ? param
ts_rank_cd: Takes tsqueryParams which may include multiple parameters for the tsquery expression

The routing logic:

if (useBm25 && tq.tsqueryParams) mainParams.add(tq.tsqueryParams[0])
else if (tq.tsqueryExpr) mainParams.addAll(tq.tsqueryParams)

This ensures the correct parameter is bound for whichever scoring method is active.

Server info

getServerInfo() now reports features: [bm25: true/false] so callers can discover whether BM25 is active.

What is NOT changed

The WHERE clause is unchanged — filtering still uses content_tsv @@ websearch_to_tsquery(). BM25 only affects scoring/ranking, not which documents match.
No new configuration required — BM25 activates automatically when pg_textsearch is installed.
No schema migration — the BM25 index is additive (CREATE INDEX IF NOT EXISTS).

Performance notes

The BM25 index is built and maintained by pg_textsearch transparently on INSERT/UPDATE.
BM25 scoring via <@> operator uses the index directly, so it's not scanning content_text at query time.
The index adds ~30% storage overhead on content_text but eliminates the need for runtime TF-IDF computation.

Depends on

PR feat: PostgreSQL-backed search/document storage as ElasticSearch alternative #700 (PostgreSQL search backend, base moqui_document table and search infrastructure)

pandor4u added 3 commits April 10, 2026 17:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Optional BM25 ranking via pg_textsearch extension#711

feat: Optional BM25 ranking via pg_textsearch extension#711
pandor4u wants to merge 3 commits intomoqui:masterfrom
pandor4u:feature/pg-textsearch-bm25

pandor4u commented Apr 10, 2026

Uh oh!

pandor4u commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Conversation

pandor4u commented Apr 10, 2026

BM25 Ranking via pg_textsearch

Problem

Solution

Uh oh!

pandor4u commented Apr 10, 2026

Detailed Change Walkthrough

Why this change?

What changed and why

Extension detection (initSchema)

BM25 index creation

Score selection (buildScoreSelect)

Parameter routing (search)

Server info

What is NOT changed

Performance notes

Depends on

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

Extension detection (`initSchema`)

Score selection (`buildScoreSelect`)

Parameter routing (`search`)