Skip to content

rasyaakbar-dev/pyxmlcheck

pyxmlcheck

Description

pyxmlcheck is a developer-centric CLI tool designed to validate the well-formedness of XML and HTML files. Unlike standard parsers that halt at the first encountered error, pyxmlcheck uses an incremental re-parsing strategy to scan the entire document, reporting all structural issues (e.g., mismatched or unclosed tags) in a single pass — with zero external dependencies.

A key feature of this tool is its intelligent DOCTYPE handling. Many HTML files contain entities (like  , ©) that standard XML parsers fail to recognize, leading to false-positive errors. pyxmlcheck solves this by seamlessly injecting a custom XHTML DOCTYPE with these entity definitions directly into the in-memory string before parsing, without mutating the original file or altering the reported error line numbers.

Requirements

  • Python 3.6+
  • No external dependencies (uses only Python standard library modules)

Installation

Clone the repository and install the required dependency using pip:

# Clone the repository
git clone https://github.com/rasyaakbar-dev/pyxmlchecker.git
cd pyxmlchecker

No additional installation steps are needed — pyxmlcheck uses only Python standard library modules.

Usage

Run the script directly via Python, passing the target file as an argument:

python xml_checker.py <filename>

Example

$ python3 xml_checker.py index.html

Testing for well-formedness: index.html ...

Error: Found 2 error(s) in index.html.
  1. Line 45, Column 8: Opening and ending tag mismatch: p line 42 and div
  2. Line 89, Column 1: Premature end of data in tag html line 1

How It Works

From a technical standpoint, pyxmlcheck performs the following workflow:

  1. File Ingestion: Reads the target file into a UTF-8 encoded string.
  2. Intelligent DOCTYPE Substitution:
    • Uses a regular expression (r'<!DOCTYPE[^>]*>') to locate any existing DOCTYPE declaration.
    • If found, it is replaced with a comprehensive XHTML DOCTYPE containing common HTML entity definitions.
    • Line Number Preservation: Crucially, the script counts the newlines in the original DOCTYPE and appends them to the injected single-line DOCTYPE. This guarantees that the line numbers reported by the parser perfectly match the original file.
    • If no DOCTYPE is found, it prepends the custom DOCTYPE on the very first line without a trailing newline.
  3. Incremental Parsing and Error Recovery: Feeds the modified string to Python's built-in xml.parsers.expat parser. On each error, it records the error details (line, column, message), then creates a fresh parser and feeds the remaining content (from after the error line) to discover additional errors.
  4. Reporting: Iterates through the collected errors, deduplicates them, and outputs color-coded, exact line-and-column error messages to the console.

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Security

If you discover a security vulnerability, please follow the instructions in SECURITY.md.

Code of Conduct

This project adheres to the Contributor Covenant Code of Conduct.

License

Distributed under the MIT License. See LICENSE for more information.

About

A simple Python utility to verify whether an HTML or XML file is well-formed XML.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Contributors

Languages