Skip to content

Boost exact-character matches over ASCII-folded matches#1026

Draft
henrik242 wants to merge 1 commit intokomoot:masterfrom
entur:henrik/boost-exact-matches
Draft

Boost exact-character matches over ASCII-folded matches#1026
henrik242 wants to merge 1 commit intokomoot:masterfrom
entur:henrik/boost-exact-matches

Conversation

@henrik242
Copy link
Contributor

Summary

  • Adds a non-folded raw sub-field on collector.name that indexes names with only lowercasing (no ASCII folding, no German normalization)
  • Adds boosted should clauses in both short and full search queries so documents whose unfolded name matches the query score higher
  • Fixes ranking where e.g. searching "Lærdal" would return "Lardal" with equal or higher prominence, because both collapsed to the same token after ASCII folding

Behavior

  • "Lærdal" → "Lærdal" ranks above "Lardal" (raw field match on the query)
  • "Lardal" → "Lardal" ranks above "Lærdal" (raw field match on the query)
  • "lardal" → still finds both via the folded fields (broad recall preserved)

No existing behaviour changes. german_normalization is left in place. The raw field only adds score - it never removes matches.

Test plan

  • New test testExactDiacriticsOverFolded validates ranking directly
  • All existing tests pass (./gradlew test)

Add a non-folded raw sub-field on collector.name that indexes names
with only lowercasing. Boosted should clauses in search queries reward
documents whose unfolded name matches the query, so that e.g. searching
"Lærdal" ranks "Lærdal" above "Lardal" while preserving broad recall.
@lonvia
Copy link
Collaborator

lonvia commented Mar 2, 2026

This only solves the issue on the full names, not during search-as-you-type (try searching for Lård). You'd need yet another secondary index for the prefix matchers. That's quite expensive and the reason I rejected this approach.

In general, a second index is only required, when the expected results don't show up at all in the result list. If it is a more a matter of in which order the results are shown, then reranking results is the cheaper approach.

@henrik242 henrik242 marked this pull request as draft March 3, 2026 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants