Create pain points before running associator to resolve #3892 by Balearica · Pull Request #3893 · tesseract-ocr/tesseract

Balearica · 2022-08-06T03:52:11Z

amitdo · 2022-10-09T21:08:28Z

The problem is that we don't know if there might be some cases where this patch will cause worst results.

@stweil, @zdenop, maybe we can accept this addition if it will be optional, using a config variable and will be turned off by default?

Balearica · 2022-10-09T21:52:57Z

@amitdo Do you have a specific scenario in mind where this change would plausibly cause worse results? Alternatively, is there some corpus of benchmark documents where we could assess the impact of this change empirically?

The behavior this PR addresses (fully documented in #3892) is clearly a bug, as there is no reason why the associator should randomly skip letters at the end of certain words. I don't think it makes sense to avoid changing this behavior (using the default settings) in the absence of specific concerns regarding this fix. As stated above, I am happy to run an accuracy benchmark if one already exists.

amitdo · 2022-10-09T23:45:26Z

Do you have a specific scenario in mind where this change would plausibly cause worse results?

No.

Don't assume that the few currently active developers deeply know all the algorithms used in Tesseract.

Alternatively, is there some corpus of benchmark documents where we could assess the impact of this change empirically?

In the past the UNLV dataset and tools were used for testing Tesseract's accuracy.

See:

But the UNLV dataset have just English and Spanish written texts. Do you think your patch is fine for all the scripts that Tesseract supports?

Related issue: #3402.

amitdo · 2024-08-23T15:25:57Z

@stweil,

I think we should trust @Balearica and accept this PR.

stweil · 2024-08-23T15:56:57Z

Yes. @Balearica, do you want to rebase the commits in this pull request and fix the author name which is currently called "Your Name"?

Your Name added 2 commits August 5, 2022 20:49

Create pain points before running associator to resolve tesseract-ocr…

98aca4b

…#3892

Minor change

1146c3a

amitdo added the legacy label Aug 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create pain points before running associator to resolve #3892#3893

Create pain points before running associator to resolve #3892#3893
Balearica wants to merge 2 commits intotesseract-ocr:mainfrom
scribeocr:legacy_segsearch_fix

Balearica commented Aug 6, 2022

Uh oh!

amitdo commented Oct 9, 2022 •

edited

Loading

Uh oh!

Balearica commented Oct 9, 2022

Uh oh!

amitdo commented Oct 9, 2022

Uh oh!

amitdo commented Aug 23, 2024

Uh oh!

stweil commented Aug 23, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Balearica commented Aug 6, 2022

Uh oh!

amitdo commented Oct 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Balearica commented Oct 9, 2022

Uh oh!

amitdo commented Oct 9, 2022

Uh oh!

amitdo commented Aug 23, 2024

Uh oh!

stweil commented Aug 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

amitdo commented Oct 9, 2022 •

edited

Loading

stweil commented Aug 23, 2024 •

edited

Loading