Skip to content

Table is considered as picture #359

@GayuTamil

Description

@GayuTamil

Im trying to parse pdf to markdown and my pdf file contains more tables. The issue I'm facing is some of the tables is extracted properly, whereas some tables are considered as picture.

Attached the pdf file which i have used
Different Table formats.pdf

Some tables is extracted properly, seems some tables is treated as picture. My table content is enclosed with the below placeholders
----- Start of picture text -----

My table content here
----- End of picture text -----

Tables text considered as picture
Attaching the screenshot of first page of the pdf for convenience

Image

Parsed text which i got
Image

Correct parsing of table text
Screen shot of page 2 from the pdf
Image

Correct Table parsing
Image

code used
import pymupdf.layout
import pymupdf4llm

md_text = pymupdf4llm.to_markdown(input_path,footer=False)
print(md_text)

Version
pymupdf-layout : 1.26.6
pymupdf4llm : 0.2.9
tesseract : 5.5.1
opencv-python : 4.13.0.90

Metadata

Metadata

Assignees

No one assigned

    Labels

    upstreamCaused outside this package

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions