pyxmlcheck is a developer-centric CLI tool designed to validate the well-formedness of XML and HTML files. Unlike standard parsers that halt at the first encountered error, pyxmlcheck uses an incremental re-parsing strategy to scan the entire document, reporting all structural issues (e.g., mismatched or unclosed tags) in a single pass — with zero external dependencies.
A key feature of this tool is its intelligent DOCTYPE handling. Many HTML files contain entities (like , ©) that standard XML parsers fail to recognize, leading to false-positive errors. pyxmlcheck solves this by seamlessly injecting a custom XHTML DOCTYPE with these entity definitions directly into the in-memory string before parsing, without mutating the original file or altering the reported error line numbers.
- Python 3.6+
- No external dependencies (uses only Python standard library modules)
Clone the repository and install the required dependency using pip:
# Clone the repository
git clone https://github.com/rasyaakbar-dev/pyxmlchecker.git
cd pyxmlcheckerNo additional installation steps are needed — pyxmlcheck uses only Python standard library modules.
Run the script directly via Python, passing the target file as an argument:
python xml_checker.py <filename>$ python3 xml_checker.py index.html
Testing for well-formedness: index.html ...
Error: Found 2 error(s) in index.html.
1. Line 45, Column 8: Opening and ending tag mismatch: p line 42 and div
2. Line 89, Column 1: Premature end of data in tag html line 1From a technical standpoint, pyxmlcheck performs the following workflow:
- File Ingestion: Reads the target file into a UTF-8 encoded string.
- Intelligent DOCTYPE Substitution:
- Uses a regular expression (
r'<!DOCTYPE[^>]*>') to locate any existing DOCTYPE declaration. - If found, it is replaced with a comprehensive XHTML DOCTYPE containing common HTML entity definitions.
- Line Number Preservation: Crucially, the script counts the newlines in the original DOCTYPE and appends them to the injected single-line DOCTYPE. This guarantees that the line numbers reported by the parser perfectly match the original file.
- If no DOCTYPE is found, it prepends the custom DOCTYPE on the very first line without a trailing newline.
- Uses a regular expression (
- Incremental Parsing and Error Recovery: Feeds the modified string to Python's built-in
xml.parsers.expatparser. On each error, it records the error details (line, column, message), then creates a fresh parser and feeds the remaining content (from after the error line) to discover additional errors. - Reporting: Iterates through the collected errors, deduplicates them, and outputs color-coded, exact line-and-column error messages to the console.
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
If you discover a security vulnerability, please follow the instructions in SECURITY.md.
This project adheres to the Contributor Covenant Code of Conduct.
Distributed under the MIT License. See LICENSE for more information.