The goal of this project is to provide a simple, local tool to convert engineering reports between Microsoft Word (.docx) and LaTeX (.tex) formats. It avoids complex cloud dependencies and focuses on the core structure of academic documents.
- DOCX Parsing: Uses
python-docxto iterate through the XML-based structure of Word files. It specifically looks forCT_P(Paragraphs) andCT_Tbl(Tables) in the document body. - LaTeX Parsing: Uses regular expressions and string analysis to identify common LaTeX environments like
\section,\textbf, andtabular.
- Styles: Maps Word heading styles (Heading 1, 2, 3) to LaTeX section levels (
\section,\subsection, etc.). - Formatting: Handles inline formatting (bold, italic, underline) by wrapping text in corresponding LaTeX commands.
- Tables: Converts Word tables into LaTeX
tabularenvironments inside atablefloat, automatically calculating column alignments. - Images: Extracts embedded images from DOCX files, saves them to a local directory, and includes them using the
graphicxpackage.
The DocTeXConverter class acts as the central hub. It:
- Detects the conversion direction based on file extensions.
- Manages
ConversionOptions(margins, font size, etc.). - Handles temporary files and directories.
- Input Stage: The user provides a file via CLI or the Web UI.
- Setup: The converter initializes and validates options.
- Execution:
- For DOCX to LaTeX: The
LatexGeneratorwalks the document, extracts text/runs, escapes special characters, and builds a .tex file. - For LaTeX to DOCX: The
DocxGeneratorreads the .tex file and builds a Word document using high-levelpython-docxcalls.
- For DOCX to LaTeX: The
- Post-Processing: (Optional) Bibliography extraction and image optimization.
- Output: The final file is saved to the specified path.
- Loose Coupling: The conversion logic is separated from the interface (CLI/Web), making it easy to embed in other tools.
- Error Handling: Custom exception classes (
ConversionError,InvalidFileFormatError) provide meaningful feedback to the user. - Casual Codebase: Comments are written in a casual, student-friendly style to make the codebase approachable for peers.