fix: Handle zero-context matches in hyphenated words. by jbaiter · Pull Request #507 · dbmdz/solr-ocrhighlighting

jbaiter · 2026-02-11T09:13:13Z

Previously we would only return the first half of the hyphenated word and not expand the snippet region to contain the second half. This has now been fixed.

During bugfixing, we also discovered a second bug related to this: If the second part of a hyphenated word was part of a fragment, and the last word in that fragment, it would be skipped during parsing of the fragment.

Both of these issues have been fixed.

Copilot

Pull request overview

Fixes OCR snippet/highlight generation for matches spanning hyphenated words when the configured snippet context is zero, and addresses a related parser edge case where the last word of a fragment could be skipped.

Changes:

Add logic in OcrPassageFormatter to expand a passage when the last token is a highlighted hyphenation start so the continuation is included.
Adjust OcrParser iteration to rely on readNext(...) returning null to signal end-of-stream, avoiding missed final tokens.
Add an hOCR fixture and a regression test covering a hyphenation match at a line break with hl.ocr.contextSize=0.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
src/test/resources/data/hocr_hyphen.html	Adds hOCR sample containing a line-break hyphenation used by a new regression test.
src/test/java/com/github/dbmdz/solrocr/solr/HocrTest.java	Adds a test asserting passage expansion and correct highlights for a hyphenated match with zero context.
src/main/java/com/github/dbmdz/solrocr/lucene/OcrPassageFormatter.java	Implements passage expansion for highlighted hyphen-start-at-end cases; refactors fragment building to work from parsed boxes.
src/main/java/com/github/dbmdz/solrocr/formats/OcrParser.java	Changes parsing loop behavior and documents `readNext(...)` contract more explicitly.
pom.xml	Bumps project version to `0.9.6-SNAPSHOT`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

schmika · 2026-02-11T12:54:52Z

+      if (hyphenEnd.isHyphenated()
+          && !hyphenEnd.isHyphenStart()
+          && hyphenEnd.getDehyphenatedForm().equals(hyphenStart.getDehyphenatedForm())) {


Could this check be a method of OcrBox? It might make it easier to understand at a glance what is being checked here.

Good idea, I'll add OcrBox::isHyphenEndOf(OcrBox other) 👍🏾

Previously we would only return the first half of the hyphenated word and not expand the snippet region to contain the second half. This has now been fixed. During bugfixing, we also discovered a second bug related to this: If the second part of a hyphenated word was part of a fragment, and the last word in that fragment, it would be skipped during parsing of the fragment. Both of these issues have been fixed.

jbaiter requested review from Copilot and schmika February 11, 2026 09:13

Copilot started reviewing on behalf of jbaiter February 11, 2026 09:13 View session

Copilot AI reviewed Feb 11, 2026

View reviewed changes

schmika reviewed Feb 11, 2026

View reviewed changes

Comment thread src/main/java/com/github/dbmdz/solrocr/lucene/OcrPassageFormatter.java Outdated

Comment thread src/main/java/com/github/dbmdz/solrocr/lucene/OcrPassageFormatter.java

jbaiter force-pushed the fix-hyphen-match-bug branch from 9f51164 to 91634b8 Compare February 11, 2026 12:21

schmika reviewed Feb 11, 2026

View reviewed changes

jbaiter force-pushed the fix-hyphen-match-bug branch from 91634b8 to 40f65e3 Compare February 11, 2026 16:23

schmika approved these changes Feb 11, 2026

View reviewed changes

jbaiter merged commit 338cec5 into main Feb 11, 2026
6 checks passed

jbaiter deleted the fix-hyphen-match-bug branch February 13, 2026 15:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Handle zero-context matches in hyphenated words.#507

fix: Handle zero-context matches in hyphenated words.#507
jbaiter merged 1 commit intomainfrom
fix-hyphen-match-bug

jbaiter commented Feb 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

schmika Feb 11, 2026 •

edited

Loading

Uh oh!

jbaiter Feb 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jbaiter commented Feb 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

schmika Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbaiter Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

schmika Feb 11, 2026 •

edited

Loading