-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Description
omml_multi_equation_paragraph.docx
Bug
When a single <w:p> paragraph contains multiple <m:oMath> sibling elements, docling
concatenates all of them into one $$ display block instead of emitting each as a separate
equation. The equations are also output in an unexpected order.
Given a paragraph with three sibling display equations (a = b, c = d, e = f), the
expected output is three separate display blocks:
$$a = b$$
$$c = d$$
$$e = f$$
Actual output is one block with all content merged (and order scrambled):
$$c=de=fa=b$$
This likely originates in the paragraph-level equation handling in
docling/backend/docx/ms_word_backend.py, where sibling <m:oMath> nodes within one <w:p>
are not iterated and split into individual equation items.
Steps to reproduce
-
Download the attached DOCX file.
-
Run:
docling --from docx --to md --output . omml_multi_equation_paragraph.docx -
Inspect
omml_multi_equation_paragraph.md. The three equations (a = b,c = d,e = f)
appear concatenated in a single$$block rather than as three separate blocks.
The DOCX contains a single paragraph with three sibling <m:oMath> elements:
<w:p>
<m:oMath> <!-- a = b --> </m:oMath>
<m:oMath> <!-- c = d --> </m:oMath>
<m:oMath> <!-- e = f --> </m:oMath>
</w:p>This structure is produced naturally by Microsoft Word when a user places multiple display
equations in the same paragraph (e.g., by pressing Enter within an equation block and
continuing to type).
Docling version
Docling version: 2.79.0
Docling Core version: 2.69.0
Docling IBM Models version: 3.12.0
Docling Parse version: 5.5.0
Python: cpython-312 (3.12.12)
Platform: Linux-5.14.0-611.16.1.el9_7.x86_64-x86_64-with-glibc2.34
Python version
Python 3.12.12
Attachments
omml_multi_equation_paragraph.docx— minimal three-equation paragraph that triggers the bug