Skip to content

Switch from pdfminer to paves to improve robustness and use multiple CPUs#4067

Open
dhdaines wants to merge 12 commits intoUnstructured-IO:mainfrom
dhdaines:switch_from_pdfminer_to_paves
Open

Switch from pdfminer to paves to improve robustness and use multiple CPUs#4067
dhdaines wants to merge 12 commits intoUnstructured-IO:mainfrom
dhdaines:switch_from_pdfminer_to_paves

Conversation

@dhdaines
Copy link
Contributor

PLAYA-PDF is a fork of pdfminer.six with a focus on robustness and efficiency. (full disclosure: it's my fork of pdfminer.six)

Unfortunately, PLAYA a'int a LAYout Analyzer - so it cannot replace pdfminer.six directly.

But, never fear, on top of PLAYA there is PAVÉS, which among other things implements the pdfminer.six layout analysis algorithms to the extent that it is mostly (but not entirely) a drop-in replacement. It is not actually faster than pdfminer.six for various reasons, but it does allow you to distribute PDF parsing across multiple CPUs, so that may help.

Because I am a bit tired of having to pin versions of pdfminer.six due to bugs and parsing failures... here is a PR with exactly that, PAVÉS dropped-in to replace pdfminer.six. I did remove the pikepdf "repairing" code as well since in general this is much more robust, but perhaps you would like to put it back!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant