feat: Optional BM25 ranking via pg_textsearch extension#711
feat: Optional BM25 ranking via pg_textsearch extension#711pandor4u wants to merge 3 commits intomoqui:masterfrom
Conversation
…rnative Simplified PR addressing review feedback: - Consolidated duplicated INSERT...ON CONFLICT SQL into shared constant - Removed fork-specific files (AGENTS.md, CLAUDE.md, GEMINI.md, docker compose, .gitignore) - Moved PostgreSQL tests to separate opt-in suite (MoquiSuite untouched) - Updated ElasticRequestLogFilter to use ElasticClient interface type - 83 tests: 46 unit (query translation + SQL injection) + 37 integration New files: - PostgresElasticClient.groovy: Full ElasticClient impl using JSONB/tsvector - ElasticQueryTranslator.groovy: ES Query DSL to PostgreSQL SQL - PostgresSearchLogger.groovy: Log4j2 appender for PostgreSQL - SearchEntities.xml: Entity definitions for search tables - PostgresSearchSuite.groovy: Separate JUnit test suite - PostgresSearchTranslatorTests.groovy: Unit tests - PostgresElasticClientTests.groovy: Integration tests Modified files: - ElasticFacadeImpl.groovy: type=postgres routing - ElasticRequestLogFilter.groovy: Interface type usage - MoquiDefaultConf.xml: Postgres config + entity load - moqui-conf-3.xsd: type attribute with elastic/postgres enum - build.gradle: Test dependencies and suite include
…ne, parameterized queries - Route HTTP logs to dedicated moqui_http_log table with typed columns instead of generic moqui_document JSONB storage - Route deleteByQuery to dedicated tables (moqui_http_log, moqui_logs) with proper timestamp range extraction for nightly cleanup jobs - Replace Java regex highlights with PostgreSQL ts_headline() for accurate, index-aware snippet generation - Fix update() to read-merge-extract pattern so content_text stays consistent with extractContentText() used by index() - Add searchHttpLogTable() for searching against dedicated HTTP log table - Improve guessCastType() to inspect actual values (epoch millis, decimals, ISO dates) when field name heuristics are ambiguous - Parameterize exists query (use ?? operator) to prevent SQL injection - Handle indexExists/createIndex/count for dedicated table names
When the pg_textsearch extension is available (PostgreSQL 17+), automatically: - Detect and enable the extension at startup - Create a BM25 index on moqui_document.content_text with english config - Use BM25 scoring (content_text <@> to_bm25query()) instead of ts_rank_cd() for document search queries, providing proper document-length normalization Falls back gracefully to ts_rank_cd() when pg_textsearch is not installed. Reports BM25 availability in getServerInfo() response. The BM25 index uses Block-Max WAND optimization for fast top-k queries and configurable parameters (k1=1.2, b=0.75 defaults).
Detailed Change WalkthroughThis PR adds optional BM25 ranking via the pg_textsearch extension (from ParadeDB). When installed, document search uses true BM25 scoring instead of PostgreSQL's built-in Why this change?PostgreSQL's native
BM25 (Best Match 25) is the standard ranking algorithm used by ElasticSearch/Lucene and is considered state-of-the-art for text relevance. It naturally handles TF saturation (diminishing returns for repeated terms) and IDF (rare terms get boosted). This makes search results significantly more relevant when the document corpus is large. What changed and whyAll changes are in Extension detection (
|
BM25 Ranking via pg_textsearch
Depends on: #700
Problem
The current PostgreSQL search backend uses
ts_rank_cd()for scoring which lacks document-length normalization — a 10,000-word document mentioning "database" once scores similarly to a 50-word document mentioning it once.Solution
When the pg_textsearch extension (v1.0.0, PostgreSQL 17+) is available, automatically:
moqui_document.content_textwith english text configcontent_text <@> to_bm25query()) instead ofts_rank_cd()for document search, providing proper document-length normalization with configurable k1/b parametersFalls back gracefully to
ts_rank_cd()when pg_textsearch is not installed. Reports BM25 availability ingetServerInfo()response.