Skip to content

feat: Optional BM25 ranking via pg_textsearch extension#711

Open
pandor4u wants to merge 3 commits intomoqui:masterfrom
pandor4u:feature/pg-textsearch-bm25
Open

feat: Optional BM25 ranking via pg_textsearch extension#711
pandor4u wants to merge 3 commits intomoqui:masterfrom
pandor4u:feature/pg-textsearch-bm25

Conversation

@pandor4u
Copy link
Copy Markdown

BM25 Ranking via pg_textsearch

Depends on: #700

Problem

The current PostgreSQL search backend uses ts_rank_cd() for scoring which lacks document-length normalization — a 10,000-word document mentioning "database" once scores similarly to a 50-word document mentioning it once.

Solution

When the pg_textsearch extension (v1.0.0, PostgreSQL 17+) is available, automatically:

  • Detect and enable the extension at startup
  • Create a BM25 index on moqui_document.content_text with english text config
  • Use BM25 scoring (content_text <@> to_bm25query()) instead of ts_rank_cd() for document search, providing proper document-length normalization with configurable k1/b parameters
  • Block-Max WAND optimization for fast top-k queries

Falls back gracefully to ts_rank_cd() when pg_textsearch is not installed. Reports BM25 availability in getServerInfo() response.

…rnative

Simplified PR addressing review feedback:
- Consolidated duplicated INSERT...ON CONFLICT SQL into shared constant
- Removed fork-specific files (AGENTS.md, CLAUDE.md, GEMINI.md, docker compose, .gitignore)
- Moved PostgreSQL tests to separate opt-in suite (MoquiSuite untouched)
- Updated ElasticRequestLogFilter to use ElasticClient interface type
- 83 tests: 46 unit (query translation + SQL injection) + 37 integration

New files:
- PostgresElasticClient.groovy: Full ElasticClient impl using JSONB/tsvector
- ElasticQueryTranslator.groovy: ES Query DSL to PostgreSQL SQL
- PostgresSearchLogger.groovy: Log4j2 appender for PostgreSQL
- SearchEntities.xml: Entity definitions for search tables
- PostgresSearchSuite.groovy: Separate JUnit test suite
- PostgresSearchTranslatorTests.groovy: Unit tests
- PostgresElasticClientTests.groovy: Integration tests

Modified files:
- ElasticFacadeImpl.groovy: type=postgres routing
- ElasticRequestLogFilter.groovy: Interface type usage
- MoquiDefaultConf.xml: Postgres config + entity load
- moqui-conf-3.xsd: type attribute with elastic/postgres enum
- build.gradle: Test dependencies and suite include
…ne, parameterized queries

- Route HTTP logs to dedicated moqui_http_log table with typed columns
  instead of generic moqui_document JSONB storage
- Route deleteByQuery to dedicated tables (moqui_http_log, moqui_logs)
  with proper timestamp range extraction for nightly cleanup jobs
- Replace Java regex highlights with PostgreSQL ts_headline() for
  accurate, index-aware snippet generation
- Fix update() to read-merge-extract pattern so content_text stays
  consistent with extractContentText() used by index()
- Add searchHttpLogTable() for searching against dedicated HTTP log table
- Improve guessCastType() to inspect actual values (epoch millis,
  decimals, ISO dates) when field name heuristics are ambiguous
- Parameterize exists query (use ?? operator) to prevent SQL injection
- Handle indexExists/createIndex/count for dedicated table names
When the pg_textsearch extension is available (PostgreSQL 17+), automatically:
- Detect and enable the extension at startup
- Create a BM25 index on moqui_document.content_text with english config
- Use BM25 scoring (content_text <@> to_bm25query()) instead of ts_rank_cd()
  for document search queries, providing proper document-length normalization

Falls back gracefully to ts_rank_cd() when pg_textsearch is not installed.
Reports BM25 availability in getServerInfo() response.

The BM25 index uses Block-Max WAND optimization for fast top-k queries
and configurable parameters (k1=1.2, b=0.75 defaults).
@pandor4u
Copy link
Copy Markdown
Author

Detailed Change Walkthrough

This PR adds optional BM25 ranking via the pg_textsearch extension (from ParadeDB). When installed, document search uses true BM25 scoring instead of PostgreSQL's built-in ts_rank_cd().

Why this change?

PostgreSQL's native ts_rank_cd() uses cover density ranking — it rewards terms appearing close together but doesn't account for term frequency normalization or inverse document frequency. In practice this means:

  • A document mentioning "database" 50 times scores the same as one mentioning it once (no TF)
  • Common words like "system" get the same weight as rare domain terms like "moqui" (no IDF)

BM25 (Best Match 25) is the standard ranking algorithm used by ElasticSearch/Lucene and is considered state-of-the-art for text relevance. It naturally handles TF saturation (diminishing returns for repeated terms) and IDF (rare terms get boosted). This makes search results significantly more relevant when the document corpus is large.

What changed and why

All changes are in PostgresElasticClient.groovy — this is a scoring-layer change, not a query-syntax change.

Extension detection (initSchema)

stmt.execute("CREATE EXTENSION IF NOT EXISTS pg_textsearch")
hasBm25Extension = true

Wrapped in try/catch — if pg_textsearch isn't installed, hasBm25Extension stays false and everything falls back to ts_rank_cd(). This is the same pattern used for pg_trgm detection.

BM25 index creation

CREATE INDEX IF NOT EXISTS idx_mq_doc_bm25
ON moqui_document USING bm25(content_text)
WITH (text_config='english')

pg_textsearch provides a custom index access method (USING bm25) that builds an inverted index optimized for BM25 scoring. The text_config='english' parameter applies the same English stemmer/stopwords as our existing tsvector config, ensuring consistent query behavior. If the index creation fails (e.g., the extension doesn't support the table layout), hasBm25Extension is set back to false.

Score selection (buildScoreSelect)

The method was refactored to accept a useBm25 parameter:

With BM25:

-(content_text <@> to_bm25query(?, 'idx_mq_doc_bm25'))
  • <@> is pg_textsearch's BM25 distance operator (lower = better match)
  • to_bm25query() converts search text to a BM25 query bound to the specific index
  • The negation (-) flips the score so higher = better match, consistent with ES conventions
  • The bind parameter receives the raw search text (not a tsquery expression)

Without BM25 (fallback):

ts_rank_cd(content_tsv, websearch_to_tsquery('english', ?))

Unchanged from PR #700.

Parameter routing (search)

BM25 and ts_rank_cd have different parameter needs:

  • BM25: to_bm25query(?, 'idx_mq_doc_bm25') takes the raw search text as a single ? param
  • ts_rank_cd: Takes tsqueryParams which may include multiple parameters for the tsquery expression

The routing logic:

if (useBm25 && tq.tsqueryParams) mainParams.add(tq.tsqueryParams[0])
else if (tq.tsqueryExpr) mainParams.addAll(tq.tsqueryParams)

This ensures the correct parameter is bound for whichever scoring method is active.

Server info

getServerInfo() now reports features: [bm25: true/false] so callers can discover whether BM25 is active.

What is NOT changed

  • The WHERE clause is unchanged — filtering still uses content_tsv @@ websearch_to_tsquery(). BM25 only affects scoring/ranking, not which documents match.
  • No new configuration required — BM25 activates automatically when pg_textsearch is installed.
  • No schema migration — the BM25 index is additive (CREATE INDEX IF NOT EXISTS).

Performance notes

  • The BM25 index is built and maintained by pg_textsearch transparently on INSERT/UPDATE.
  • BM25 scoring via <@> operator uses the index directly, so it's not scanning content_text at query time.
  • The index adds ~30% storage overhead on content_text but eliminates the need for runtime TF-IDF computation.

Depends on

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant