Skip to content

provero: A vendor-neutral, declarative data quality engine #290

@andreahlert

Description

@andreahlert

Submitting Author: (@andreahlert)
All current maintainers: (@andreahlert)
Package Name: provero
One-Line Description of Package: A vendor-neutral, declarative data quality engine for validating datasets using YAML-based check definitions.
Repository Link: https://github.com/provero-org/provero
Version submitted: 0.1.1
EiC: TBD
Editor: TBD
Reviewer 1: TBD
Reviewer 2: TBD
Archive: TBD
JOSS DOI: TBD
Version accepted: TBD
Date accepted (month/day/year): TBD


Code of Conduct & Commitment to Maintain Package

Description

Provero is a vendor-neutral, declarative data quality engine. Users define quality checks in a simple provero.yaml file and run them against any supported data source (DuckDB, PostgreSQL, Pandas/Polars DataFrames) with a single CLI command. The package provides 14 built-in check types (not_null, unique, completeness, range, freshness, anomaly detection, custom SQL, and more), a SQL batch optimizer that compiles multiple checks into a single query for performance, data contracts with schema validation and SLA enforcement, statistical anomaly detection using Z-Score, MAD, and IQR methods without requiring scipy, HTML report generation, webhook alerts for Slack/PagerDuty, a SQLite-based result store with time-series history, and a data profiler that auto-generates check suggestions. Provero also ships an Airflow provider and a Flyte plugin for orchestrated pipeline integration.

Scope

  • Please indicate which category or categories.
    Check out our package scope page to learn more about our
    scope. (If you are unsure of which category you fit, we suggest you make a pre-submission inquiry):

    • Data retrieval
    • Data extraction
    • Data processing/munging
    • Data deposition
    • Data validation and testing
    • Data visualization1
    • Workflow automation
    • Citation management and bibliometrics
    • Scientific software wrappers
    • Database interoperability

Domain Specific

  • Geospatial
  • Education

Community Partnerships

If your package is associated with an
existing community please check below:

  • For all submissions, explain how and why the package falls under the categories you indicated above. In your explanation, please address the following points (briefly, 1-2 sentences for each):

    • Who is the target audience and what are scientific applications of this package?

      The target audience is data engineers, data scientists, and researchers who need to validate datasets before analysis or model training. Provero is useful in scientific workflows where data integrity is critical, such as verifying that experimental datasets meet expected schema and value constraints before running statistical analyses or ML pipelines.

    • Are there other Python packages that accomplish the same thing? If so, how does yours differ?

      Great Expectations and Soda Core are the main alternatives. Provero differs by being fully declarative (YAML-only configuration, no Python code required for common checks), vendor-neutral with a plugin architecture, and lightweight with zero heavy dependencies. It also includes a SQL batch optimizer that compiles N checks into a single query, built-in anomaly detection without scipy, data contracts with SLA enforcement, and native Airflow/Flyte integrations as first-class packages.

    • If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted:

      N/A

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

  • does not violate the Terms of Service of any service it interacts with.
  • uses an OSI approved license.
  • contains a README with instructions for installing the development version.
  • includes documentation with examples for all functions.
  • contains a tutorial with examples of its essential functions and uses.
  • has a test suite.
  • has continuous integration setup, such as GitHub Actions CircleCI, and/or others.

Publication Options

JOSS Checks
  • The package has an obvious research application according to JOSS's definition in their submission requirements. Be aware that completing the pyOpenSci review process does not guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS.
  • The package is not a "minor utility" as defined by JOSS's submission requirements: "Minor 'utility' packages, including 'thin' API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria.
  • The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
  • The package is deposited in a long-term repository with the DOI:

Note: JOSS accepts our review as theirs. You will NOT need to go through another full review. JOSS will only review your paper.md file. Be sure to link to this pyOpenSci issue when a JOSS issue is opened for your package. Also be sure to tell the JOSS editor that this is a pyOpenSci reviewed package once you reach this step.

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

  • Yes I am OK with reviewers submitting requested changes as issues to my repo. Reviewers will then link to the issues in their submitted review.

Confirm each of the following by checking the box.

  • I have read the author guide.
  • I expect to maintain this package for at least 2 years and can help find a replacement for the maintainer (team) if needed.

Please fill out our survey

P.S. Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

The editor template can be found here.

The review template can be found here.

Footnotes

  1. Please fill out a pre-submission inquiry before submitting a data visualization package.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions