Skip to content

SurrealByDesign/dataset-forge

Repository files navigation

Dataset Forge

v0.1 alpha -- implements the v1 Inspect slice.

Dataset Forge inspects image datasets for GPT-style artifacts and produces explainable, evidence-backed findings.

It is designed for practitioners preparing LoRA training datasets who want to understand what is wrong with their data -- and what should be left alone -- before doing anything to it.

v0.1 alpha is analysis only. It reads your dataset. It does not touch your images. Cleanup, UI, plugins, and additional analyzers are not part of this release.


What it does

Dataset Forge builds a statistical picture of your dataset, then runs independent analyzers that compare each image against that baseline. Each finding explains what was measured, how confident the analyzer is, and what action (if any) is warranted.

A healthy dataset can legitimately produce zero findings. That is a valid and correct result, not a failure.

Analyzers in v1:

Analyzer What it detects Status
texture_analyzer/v1 Elevated microtexture density relative to dataset baseline First-pass; uncalibrated
crystalline_faceting_analyzer/v1 Angular micro-polygon shading on surfaces First-pass; uncalibrated

Both analyzers are conservative. Confidence values are capped at 0.70 and 0.45 respectively until calibration against labeled ground truth is complete. Treat findings as candidates for human review, not automated decisions.


Who it is for

Dataset Forge is for people who:

  • Train LoRA models and suspect their image dataset carries GPT fingerprints
  • Want to know what is wrong before they change anything
  • Work with images generated by GPT-based tools (DALL-E, Midjourney, Ideogram, etc.)
  • Need findings they can audit, not opaque scores

It is not a general image quality tool, an upscaler, or a cleanup utility. The first reference use case is watercolor and colored-pencil anthropomorphic character datasets with GPT-style artifacts including crystalline microtexture, glitter-like speckle, periodic frequency contamination, oversharpening, and edge halos.


Current limitations (v0.1 alpha)

  • Analyzers are not calibrated to published ground truth. Thresholds were derived from an initial labeled review of one private dataset. Precision and recall are known for that dataset but are not general. Treat findings as informed candidates for human review, not certified detections.

  • Two analyzers ship in v0.1 alpha. Three planned families (speck/glitter, periodic frequency noise, oversharpening/halo) are not yet implemented. Research probes for two of these have been completed and deferred.

  • No cleanup. v0.1 alpha is read-only. Cleanup is planned for v2 and will require human approval at every step. See ROADMAP.md. Code for future cleanup phases exists in the repository but is not active or supported in v0.1 alpha.

  • No UI. Dataset Forge is a CLI tool. Report output is JSON and plain text.

  • z-score findings require dataset context. texture_analyzer/v1 uses dataset-relative z-scores. On a dataset of fewer than five images the baseline statistics are not meaningful.

  • Most scripts are internal development utilities. The public scripts are run_benchmarks.py, generate_crystalline_fixtures.py, generate_texture_fixtures.py, and generate_benchmark_defects.py. All other files in scripts/ -- whether prefixed with _ or not -- are internal calibration, diagnostic, or research tools and are not part of the public API. scripts/research/ holds artifact-family research probes.


Requirements

  • Python 3.11 or newer
  • uv (recommended) or pip

Runtime dependencies (installed automatically):

  • Pillow >= 10.0
  • opencv-python >= 4.10

Install

git clone https://github.com/surrealbydesign/dataset-forge.git
cd dataset-forge
uv sync

Or with pip:

pip install -e .

First run

Point it at a folder of images:

uv run dataset-forge inspect path/to/your/dataset/

With an explicit output directory:

uv run dataset-forge inspect path/to/your/dataset/ --output path/to/report/

Terminal output:

Dataset Forge Inspect
=====================
Dataset:  path/to/your/dataset/
Output:   path/to/your/dataset/inspect_output

Images:   100
Analyzed: 100
Errors:   0

Summary
-------
Total findings:  19
  HIGH severity:  2
  MEDIUM severity: 11
  LOW severity:   6

Images with findings:  15 / 100
Images with no issues: 85 / 100

85 images require no action.
15 images have findings. Review report for details.

Report written:
  path/to/your/dataset/inspect_output/inspection_report.json
  path/to/your/dataset/inspect_output/inspection_report.txt

Reports are written to the output directory (default: a folder named inspect_output/ inside your dataset directory). Source images are not touched.

Optional: inspection gallery

uv run dataset-forge inspect path/to/dataset/ --gallery

Writes inspection_gallery.png -- a contact sheet with findings grouped by severity alongside clean reference images.


Reading the report

Each finding in inspection_report.txt looks like this:

image_023.png
  [MEDIUM] artifact.crystalline_faceting  --  confidence 0.45 (FP rate ~28%)
  Benchmark: uncalibrated
  Evidence: pencil_grain_score=64.2, watercolor_smoothness_score=36.6, microtexture_density_score=65.8
  Why: pencil_grain=64.2 is above the detection threshold. Crystalline
       surface faceting detected based on mid-frequency texture pattern.
  Action: Candidate for review. Do not apply cleanup without inspecting
          the image.

Every finding includes:

  • Severity (LOW / MEDIUM / HIGH / CRITICAL)
  • Confidence and estimated false-positive rate
  • Benchmark version -- uncalibrated means thresholds have not been validated against published ground truth for your dataset type
  • Raw evidence -- the measurements that produced the finding
  • A plain-language explanation of why the finding was made
  • A recommended action, which may be "leave alone"

Images with no findings are listed separately. They are not an afterthought.


Safety guarantees

  • Source images are read-only. Dataset Forge never writes to your image files. No move, rename, modify, or delete operation is performed on source images.
  • Reports are written separately. All output goes to the directory you specify, not inside your dataset.
  • Cleanup is not implemented in v1. There is no flag or command that modifies images in any way. This is by design.
  • Every finding is explainable. No finding is emitted without an evidence dict, a human-readable explanation, and a recommendation. No black-box scores.
  • Healthy images produce no findings. The tool does not generate recommendations for images that do not warrant them.

Benchmarks

Analyzer thresholds are validated against committed synthetic fixtures. The public benchmark suite runs without any setup from a fresh clone:

uv run python scripts/run_benchmarks.py

Current public coverage: 10 expectations across TextureAnalyzer and CrystallineFacetingAnalyzer. All 10 pass. See benchmarks/README.md for the full manifest description.

Internal measurement cache

The disk-backed measurement cache is internal and opt-in. It is disabled by default, stores measurements only, and has no CLI flags.

  • DATASET_FORGE_MEASUREMENT_CACHE_DIR=/path/to/cache enables the cache.
  • DATASET_FORGE_DISABLE_MEASUREMENT_CACHE=1 bypasses cache reads and writes.

Tests

uv run pytest tests/

648 tests passing. Tests cover the full v1 pipeline: Finding, DatasetContext, Analyzer contracts, report writers, CLI, inspect runner, gallery, benchmark framework, committed fixtures, and public CLI surface.


License

MIT. See LICENSE.


Architecture and project docs

Document Contents
PROJECT_BIBLE.md Project constitution -- read before changing anything
ARCHITECTURE.md v1 pipeline structure, Finding schema, artifact family model
WHY.md Reasoning behind major design decisions
DIRECTION.md Current milestone and scope
ROADMAP.md v1 -> v2 -> v3 milestone plan
CURRENT_STATUS.md Implementation status; resume from here
CLI_OUTPUT.md Acceptance criteria for terminal and report output
benchmarks/README.md Benchmark manifests and fixture inventory

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages