Dataset Forge

v0.1 alpha -- implements the v1 Inspect slice.

Dataset Forge inspects image datasets for GPT-style artifacts and produces explainable, evidence-backed findings.

It is designed for practitioners preparing LoRA training datasets who want to understand what is wrong with their data -- and what should be left alone -- before doing anything to it.

v0.1 alpha is analysis only. It reads your dataset. It does not touch your images. Cleanup, UI, plugins, and additional analyzers are not part of this release.

What it does

Dataset Forge builds a statistical picture of your dataset, then runs independent analyzers that compare each image against that baseline. Each finding explains what was measured, how confident the analyzer is, and what action (if any) is warranted.

A healthy dataset can legitimately produce zero findings. That is a valid and correct result, not a failure.

Analyzers in v1:

Analyzer	What it detects	Status
`texture_analyzer/v1`	Elevated microtexture density relative to dataset baseline	First-pass; uncalibrated
`crystalline_faceting_analyzer/v1`	Angular micro-polygon shading on surfaces	First-pass; uncalibrated

Both analyzers are conservative. Confidence values are capped at 0.70 and 0.45 respectively until calibration against labeled ground truth is complete. Treat findings as candidates for human review, not automated decisions.

Who it is for

Dataset Forge is for people who:

Train LoRA models and suspect their image dataset carries GPT fingerprints
Want to know what is wrong before they change anything
Work with images generated by GPT-based tools (DALL-E, Midjourney, Ideogram, etc.)
Need findings they can audit, not opaque scores

It is not a general image quality tool, an upscaler, or a cleanup utility. The first reference use case is watercolor and colored-pencil anthropomorphic character datasets with GPT-style artifacts including crystalline microtexture, glitter-like speckle, periodic frequency contamination, oversharpening, and edge halos.

Current limitations (v0.1 alpha)

Analyzers are not calibrated to published ground truth. Thresholds were derived from an initial labeled review of one private dataset. Precision and recall are known for that dataset but are not general. Treat findings as informed candidates for human review, not certified detections.
Two analyzers ship in v0.1 alpha. Three planned families (speck/glitter, periodic frequency noise, oversharpening/halo) are not yet implemented. Research probes for two of these have been completed and deferred.
No cleanup. v0.1 alpha is read-only. Cleanup is planned for v2 and will require human approval at every step. See ROADMAP.md. Code for future cleanup phases exists in the repository but is not active or supported in v0.1 alpha.
No UI. Dataset Forge is a CLI tool. Report output is JSON and plain text.
z-score findings require dataset context. texture_analyzer/v1 uses dataset-relative z-scores. On a dataset of fewer than five images the baseline statistics are not meaningful.
Most scripts are internal development utilities. The public scripts are run_benchmarks.py, generate_crystalline_fixtures.py, generate_texture_fixtures.py, and generate_benchmark_defects.py. All other files in scripts/ -- whether prefixed with _ or not -- are internal calibration, diagnostic, or research tools and are not part of the public API. scripts/research/ holds artifact-family research probes.

Requirements

Python 3.11 or newer
uv (recommended) or pip

Runtime dependencies (installed automatically):

Pillow >= 10.0
opencv-python >= 4.10

Install

git clone https://github.com/surrealbydesign/dataset-forge.git
cd dataset-forge
uv sync

Or with pip:

pip install -e .

First run

Point it at a folder of images:

uv run dataset-forge inspect path/to/your/dataset/

With an explicit output directory:

uv run dataset-forge inspect path/to/your/dataset/ --output path/to/report/

Terminal output:

Dataset Forge Inspect
=====================
Dataset:  path/to/your/dataset/
Output:   path/to/your/dataset/inspect_output

Images:   100
Analyzed: 100
Errors:   0

Summary
-------
Total findings:  19
  HIGH severity:  2
  MEDIUM severity: 11
  LOW severity:   6

Images with findings:  15 / 100
Images with no issues: 85 / 100

85 images require no action.
15 images have findings. Review report for details.

Report written:
  path/to/your/dataset/inspect_output/inspection_report.json
  path/to/your/dataset/inspect_output/inspection_report.txt

Reports are written to the output directory (default: a folder named inspect_output/ inside your dataset directory). Source images are not touched.

Optional: inspection gallery

uv run dataset-forge inspect path/to/dataset/ --gallery

Writes inspection_gallery.png -- a contact sheet with findings grouped by severity alongside clean reference images.

Reading the report

Each finding in inspection_report.txt looks like this:

image_023.png
  [MEDIUM] artifact.crystalline_faceting  --  confidence 0.45 (FP rate ~28%)
  Benchmark: uncalibrated
  Evidence: pencil_grain_score=64.2, watercolor_smoothness_score=36.6, microtexture_density_score=65.8
  Why: pencil_grain=64.2 is above the detection threshold. Crystalline
       surface faceting detected based on mid-frequency texture pattern.
  Action: Candidate for review. Do not apply cleanup without inspecting
          the image.

Every finding includes:

Severity (LOW / MEDIUM / HIGH / CRITICAL)
Confidence and estimated false-positive rate
Benchmark version -- uncalibrated means thresholds have not been validated against published ground truth for your dataset type
Raw evidence -- the measurements that produced the finding
A plain-language explanation of why the finding was made
A recommended action, which may be "leave alone"

Images with no findings are listed separately. They are not an afterthought.

Safety guarantees

Source images are read-only. Dataset Forge never writes to your image files. No move, rename, modify, or delete operation is performed on source images.
Reports are written separately. All output goes to the directory you specify, not inside your dataset.
Cleanup is not implemented in v1. There is no flag or command that modifies images in any way. This is by design.
Every finding is explainable. No finding is emitted without an evidence dict, a human-readable explanation, and a recommendation. No black-box scores.
Healthy images produce no findings. The tool does not generate recommendations for images that do not warrant them.

Benchmarks

Analyzer thresholds are validated against committed synthetic fixtures. The public benchmark suite runs without any setup from a fresh clone:

uv run python scripts/run_benchmarks.py

Current public coverage: 10 expectations across TextureAnalyzer and CrystallineFacetingAnalyzer. All 10 pass. See benchmarks/README.md for the full manifest description.

Internal measurement cache

The disk-backed measurement cache is internal and opt-in. It is disabled by default, stores measurements only, and has no CLI flags.

DATASET_FORGE_MEASUREMENT_CACHE_DIR=/path/to/cache enables the cache.
DATASET_FORGE_DISABLE_MEASUREMENT_CACHE=1 bypasses cache reads and writes.

Tests

uv run pytest tests/

648 tests passing. Tests cover the full v1 pipeline: Finding, DatasetContext, Analyzer contracts, report writers, CLI, inspect runner, gallery, benchmark framework, committed fixtures, and public CLI surface.

License

MIT. See LICENSE.

Architecture and project docs

Document	Contents
PROJECT_BIBLE.md	Project constitution -- read before changing anything
ARCHITECTURE.md	v1 pipeline structure, Finding schema, artifact family model
WHY.md	Reasoning behind major design decisions
DIRECTION.md	Current milestone and scope
ROADMAP.md	v1 -> v2 -> v3 milestone plan
CURRENT_STATUS.md	Implementation status; resume from here
CLI_OUTPUT.md	Acceptance criteria for terminal and report output
benchmarks/README.md	Benchmark manifests and fixture inventory

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataset Forge

What it does

Who it is for

Current limitations (v0.1 alpha)

Requirements

Install

First run

Optional: inspection gallery

Reading the report

Safety guarantees

Benchmarks

Internal measurement cache

Tests

License

Architecture and project docs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
benchmarks		benchmarks
docs		docs
presets		presets
scripts		scripts
src/dataset_forge		src/dataset_forge
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CLAUDE.md		CLAUDE.md
CLI_OUTPUT.md		CLI_OUTPUT.md
CURRENT_STATUS.md		CURRENT_STATUS.md
DIRECTION.md		DIRECTION.md
LICENSE		LICENSE
PLUGIN_DEVELOPMENT.md		PLUGIN_DEVELOPMENT.md
PROJECT_BIBLE.md		PROJECT_BIBLE.md
README.md		README.md
ROADMAP.md		ROADMAP.md
WHY.md		WHY.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Dataset Forge

What it does

Who it is for

Current limitations (v0.1 alpha)

Requirements

Install

First run

Optional: inspection gallery

Reading the report

Safety guarantees

Benchmarks

Internal measurement cache

Tests

License

Architecture and project docs

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages