Create pain points before running associator to resolve #3892#3893
Create pain points before running associator to resolve #3892#3893Balearica wants to merge 2 commits intotesseract-ocr:mainfrom
Conversation
|
@amitdo Do you have a specific scenario in mind where this change would plausibly cause worse results? Alternatively, is there some corpus of benchmark documents where we could assess the impact of this change empirically? The behavior this PR addresses (fully documented in #3892) is clearly a bug, as there is no reason why the associator should randomly skip letters at the end of certain words. I don't think it makes sense to avoid changing this behavior (using the default settings) in the absence of specific concerns regarding this fix. As stated above, I am happy to run an accuracy benchmark if one already exists. |
No. Don't assume that the few currently active developers deeply know all the algorithms used in Tesseract.
In the past the UNLV dataset and tools were used for testing Tesseract's accuracy. See: But the UNLV dataset have just English and Spanish written texts. Do you think your patch is fine for all the scripts that Tesseract supports? Related issue: #3402. |
|
I think we should trust @Balearica and accept this PR. |
|
Yes. @Balearica, do you want to rebase the commits in this pull request and fix the author name which is currently called "Your Name"? |
See #3892