-
Notifications
You must be signed in to change notification settings - Fork 175
Description
Im trying to parse pdf to markdown and my pdf file contains more tables. The issue I'm facing is some of the tables is extracted properly, whereas some tables are considered as picture.
Attached the pdf file which i have used
Different Table formats.pdf
Some tables is extracted properly, seems some tables is treated as picture. My table content is enclosed with the below placeholders
----- Start of picture text -----
My table content here
----- End of picture text -----
Tables text considered as picture
Attaching the screenshot of first page of the pdf for convenience
Correct parsing of table text
Screen shot of page 2 from the pdf

code used
import pymupdf.layout
import pymupdf4llm
md_text = pymupdf4llm.to_markdown(input_path,footer=False)
print(md_text)
Version
pymupdf-layout : 1.26.6
pymupdf4llm : 0.2.9
tesseract : 5.5.1
opencv-python : 4.13.0.90

