Skip to content

Releases: agude/SWITRS-to-SQLite

v4.5.0: Header-based column resolution for robust CSV parsing

20 Jan 02:48

Choose a tag to compare

Release Notes:

This release refactors the CSV parsing system to use dynamic header-based column resolution instead of
hardcoded indices. This makes the parser resilient to column reordering in future SWITRS data releases and
includes several performance optimizations for processing large files.

There are no breaking changes to the CLI or database schema in this release.

What's New

  • Header-Based Column Resolution: The parser now reads CSV headers at runtime to determine column
    positions, rather than relying on hardcoded indices. This ensures compatibility if CHP reorders columns in
    future SWITRS data exports.

  • Duplicate Header Detection: Added validation that fails fast with a clear error message if a CSV file
    contains duplicate column headers, preventing subtle data ingestion bugs.

  • Performance Optimizations: Pre-calculated column indices during file initialization to eliminate per-row
    dictionary lookups and iteration. For multi-million row SWITRS files, this reduces overhead in the hot path
    of __set_values and date conversion methods.

  • Automatic BOM Handling: Switched file reading to use utf-8-sig encoding, which automatically strips
    the byte-order mark if present. This is the Pythonic approach compared to manual string manipulation.

  • Empty File Handling: Added graceful handling for empty input files (or files containing only a BOM),
    which previously caused an unhandled StopIteration exception.

  • Code Clarity: Renamed RowClass to row_parser in main.py to accurately reflect that these are
    CSVParser instances, not class types.

Release 4.4.0: `src` Layout Migration & Integration Tests

19 Jan 00:27

Choose a tag to compare

Release Notes:

This release modernizes the repository structure by migrating to the industry-standard src layout, ensuring better isolation between the development environment and installed package. It also introduces a comprehensive end-to-end integration testing suite using golden snapshots to guarantee database consistency across versions.

There are no breaking changes to the CLI or database schema in this release.

What's New

  • src Layout Migration: Moved the package source code into a src/ subdirectory. This prevents common "double import" errors where tests run against the local directory instead of the installed package, and aligns the project with modern Python packaging standards (supported natively by uv and hatch).

  • Golden Snapshot Integration Tests: Added a new end-to-end integration test suite (tests/test_integration.py) that converts raw CSV data to SQLite and verifies the result against a golden_snapshot.json. This ensures that changes to parsers or converters do not silently alter the resulting database structure or content.

  • Test Data Extraction Tools: Added scripts/extract_test_rows.py, a utility that analyzes massive SWITRS datasets to greedily select a small subset of rows that maximize coverage of all internal value mappings (enums). This allows for high-coverage testing with minimal file size.

  • Type Hinting Improvements: Updated strict type checking configurations in mypy and added explicit return type annotations (-> None) to the test suite.

v4.3.0: Schema Dataclass Refactor

18 Jan 04:01

Choose a tag to compare

Release Notes:

This release refactors the internal schema definition system from tuple-based DSL to typed dataclasses, improving code readability, type safety, and maintainability. Converter functions now use explicit signatures instead of **kwargs.

There are no breaking changes to the CLI or database schema in this release.

What's New

  • Column Dataclass Schema: Replaced the tuple-based "mystery meat" DSL in row_types.py with a frozen Column dataclass in the new schema.py module. Field definitions are now self-documenting with named attributes (index, name, sql_type, nulls, converter, mapping) instead of positional tuple elements.

  • Explicit Converter Signatures: Refactored all converter functions from **kwargs to explicit (val, dtype, nulls) parameters. This enables IDE autocompletion, catches typos at call sites, and makes function contracts clear.

  • Default Identity Converter: Added an identity() converter function as the default for Column. The parser now unconditionally calls col.converter() without None checks, simplifying the parsing loop.

  • Set-Based Null Checks: Changed DEFAULT_NULLS from a list to a set for O(1) membership lookups. Custom null collections now use set union (|) instead of list concatenation.

  • Modern Type Annotations: Updated type hints to use collections.abc imports (Callable, Collection, Mapping) instead of typing module equivalents. The nulls parameter now accepts any Collection[str], allowing sets, tuples, or lists.

  • Improved Type Safety: The Column.mapping field uses Mapping[str, str | None] (covariant) instead of dict, allowing the existing value maps to pass type checking without modification.

v4.2.0: Modernize Build System and Development Tooling

18 Jan 00:37

Choose a tag to compare

Release Notes:

This release modernizes the project's build system and development tooling. It migrates from legacy setup.py to modern Python packaging standards, adds automated linting and formatting, and updates CI/CD workflows.

There are no breaking changes to the CLI or database schema in this release.

What's New

  • Modern Python Packaging: Migrated from setup.py to pyproject.toml using hatchling as the build backend. This removes the pypandoc dependency and enables native markdown README support on PyPI.

  • UV Package Manager: Adopted UV for fast, reproducible dependency management. A uv.lock file is now included for consistent environments across development and CI.

  • Ruff Linting and Formatting: Added Ruff for linting and code formatting. The codebase has been reformatted for consistency, imports are sorted, and string formatting modernized to f-strings.

  • Updated CI/CD Workflows: GitHub Actions workflows now use UV for installation and include Ruff checks. The release workflow uses reusable workflows to eliminate duplication. All actions updated to current versions (checkout v6, setup-uv v7, pypi-publish v1.13).

  • Dependabot Configuration: Added Dependabot for automated monthly updates to GitHub Actions and UV dependencies.

  • Justfile Task Runner: Added a Justfile for common development commands. Run just to see available tasks including test, lint, format, build, and check.

  • Python Version Support: Updated supported Python versions to 3.10–3.14. Dropped Python 3.8 and 3.9 which are in security-only maintenance.

4.1.3

02 Feb 23:48

Choose a tag to compare

Fix a couple of Makes that were mapped wrong.

Full Changelog: 4.1.2...4.1.3

4.1.2

08 Oct 22:09

Choose a tag to compare

Update packaging.

4.1.1

08 Oct 21:29

Choose a tag to compare

Fix error in pypandoc that prevented package publishing.

4.1.0

08 Oct 21:14

Choose a tag to compare

Minor version change. Includes:

  • Fixed a bug when running directly from the checked-out repo (#13)
  • Fixed a bug when parsing non-uft-8 text found in never versions of the raw data (#14)

Thanks a ton to @gonewest818 for catching and fixed the bugs!

4.0.0

28 Aug 21:06

Choose a tag to compare

The first major version change in almost a year! This one includes:

  • Almost every categorical column has been mapped to a human readable string. chp_road_type is one exception, because I can not find information about what the road types are.
  • Closed #7: There is now a human-redable county_location column.
  • Closed #3: bicycle_collision is now True/False instead of True/NULL.
  • Closed #6: The most common makes are now normalized, but there are still a bunch of missing ones. See new bug: #12

3.0.12

08 Aug 18:05

Choose a tag to compare

No code changes, just CICD stuff.