Skip to content

fix: Handle zero-context matches in hyphenated words.#507

Merged
jbaiter merged 1 commit intomainfrom
fix-hyphen-match-bug
Feb 11, 2026
Merged

fix: Handle zero-context matches in hyphenated words.#507
jbaiter merged 1 commit intomainfrom
fix-hyphen-match-bug

Conversation

@jbaiter
Copy link
Copy Markdown
Member

@jbaiter jbaiter commented Feb 11, 2026

Previously we would only return the first half of the hyphenated word and not expand the snippet region to contain the second half. This has now been fixed.

During bugfixing, we also discovered a second bug related to this: If the second part of a hyphenated word was part of a fragment, and the last word in that fragment, it would be skipped during parsing of the fragment.

Both of these issues have been fixed.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes OCR snippet/highlight generation for matches spanning hyphenated words when the configured snippet context is zero, and addresses a related parser edge case where the last word of a fragment could be skipped.

Changes:

  • Add logic in OcrPassageFormatter to expand a passage when the last token is a highlighted hyphenation start so the continuation is included.
  • Adjust OcrParser iteration to rely on readNext(...) returning null to signal end-of-stream, avoiding missed final tokens.
  • Add an hOCR fixture and a regression test covering a hyphenation match at a line break with hl.ocr.contextSize=0.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/test/resources/data/hocr_hyphen.html Adds hOCR sample containing a line-break hyphenation used by a new regression test.
src/test/java/com/github/dbmdz/solrocr/solr/HocrTest.java Adds a test asserting passage expansion and correct highlights for a hyphenated match with zero context.
src/main/java/com/github/dbmdz/solrocr/lucene/OcrPassageFormatter.java Implements passage expansion for highlighted hyphen-start-at-end cases; refactors fragment building to work from parsed boxes.
src/main/java/com/github/dbmdz/solrocr/formats/OcrParser.java Changes parsing loop behavior and documents readNext(...) contract more explicitly.
pom.xml Bumps project version to 0.9.6-SNAPSHOT.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/main/java/com/github/dbmdz/solrocr/lucene/OcrPassageFormatter.java Outdated
Comment thread src/main/java/com/github/dbmdz/solrocr/lucene/OcrPassageFormatter.java Outdated
Comment thread src/main/java/com/github/dbmdz/solrocr/lucene/OcrPassageFormatter.java Outdated
@jbaiter jbaiter force-pushed the fix-hyphen-match-bug branch from 9f51164 to 91634b8 Compare February 11, 2026 12:21
Comment on lines +275 to +277
if (hyphenEnd.isHyphenated()
&& !hyphenEnd.isHyphenStart()
&& hyphenEnd.getDehyphenatedForm().equals(hyphenStart.getDehyphenatedForm())) {
Copy link
Copy Markdown
Contributor

@schmika schmika Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this check be a method of OcrBox? It might make it easier to understand at a glance what is being checked here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I'll add OcrBox::isHyphenEndOf(OcrBox other) 👍🏾

Previously we would only return the first half of the hyphenated word
and not expand the snippet region to contain the second half. This has
now been fixed.

During bugfixing, we also discovered a second bug related to this: If
the second part of a hyphenated word was part of a fragment, and the
last word in that fragment, it would be skipped during parsing of
the fragment.

Both of these issues have been fixed.
@jbaiter jbaiter force-pushed the fix-hyphen-match-bug branch from 91634b8 to 40f65e3 Compare February 11, 2026 16:23
@jbaiter jbaiter merged commit 338cec5 into main Feb 11, 2026
6 checks passed
@jbaiter jbaiter deleted the fix-hyphen-match-bug branch February 13, 2026 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants