PeachTree

Part of the CyberViser AI ecosystem

PeachTree is the recursive learning-tree dataset engine for CyberViser / 0AI.

It turns owned repositories, documentation, tests, fuzz reports, issue notes, telemetry summaries, and architecture plans into traceable, safe, deduplicated JSONL datasets for model training and security-assurance workflows.

🌐 Company: cyberviserai.com
📚 Documentation: 0ai-cyberviser.github.io/PeachTree
📦 Repository: github.com/0ai-Cyberviser/PeachTree

PeachTree is designed as a shared dependency for:

PeachFuzz — fuzzing findings, crash triage, and regression corpora
See it in action at cyberviserai.com
Hancock — cybersecurity LLM agent datasets
CactusFuzz — adversarial testing and guardrail validation
0AI Portfolio — broader CyberViser / 0AI ecosystem coordination

What PeachTree does

PeachTree provides the review-first data layer between raw project material and downstream model/fuzzing workflows.

flowchart TD
    A[Training Goal] --> B[Recursive Learning Tree]
    B --> C[Owned Source Collection]
    C --> D[Safety + License Gate]
    D --> E[Dataset Builder]
    E --> F[JSONL Training Dataset]
    E --> G[Manifest + Provenance]
    E --> H[PeachFuzz Seeds]
    G --> I[Gap Analysis]
    I --> B

Mission

PeachTree helps CyberViser build datasets that are:

traceable — every record keeps source and provenance metadata
safe — secrets, tokens, private keys, and unsafe records are blocked or flagged
deduplicated — deterministic dataset reduction avoids repeated examples
reviewable — plans, manifests, diffs, model cards, and release bundles are generated before publication
local-first — no broad public scraping or automatic training launches by default

Safety defaults

PeachTree does not blindly scrape GitHub.

local/owned repository ingestion is enabled first
public GitHub collection is disabled by default
public collection requires explicit opt-in, license allowlists, rate limits, and provenance
secret/token/private-key patterns are blocked
provenance metadata is attached to every record

Quick Start

Installation

# From PyPI
pip install peachtree-ai

# From source
git clone https://github.com/0ai-Cyberviser/PeachTree.git
cd PeachTree
pip install -e ".[dev]"

Build Your First Dataset

# Plan dataset structure
peachtree plan --source /path/to/repo --goal "Security training data"

# Ingest repository
peachtree ingest --repo /path/to/repo --output data/

# Build JSONL dataset
peachtree build --input data/ --output dataset.jsonl

# Validate dataset
peachtree audit --dataset dataset.jsonl

# Check policy compliance
peachtree policy --dataset dataset.jsonl --pack safety

Features

Core Capabilities

Recursive Learning Trees — Define dataset structures with branching, filtering, and composition
Safety Gates — Automatic secret detection, license filtering, content validation
Policy Packs — Composable compliance rules for quality, deduplication, safety
JSONL Datasets — Full provenance: source, path, commit digest, timestamps
Deduplication — Content-hash, semantic, and fuzzy matching (3 methods)
Quality Scoring — Automated metrics for dataset quality and completeness
Release Bundles — SBOM, signatures, model cards, trainer handoff manifests

Ecosystem Integration

Hancock — Cybersecurity LLM agent datasets
PeachFuzz — Fuzzing corpora and crash triage
CyberViser — Project hub and documentation surface
0AI — Broader ecosystem coordination

Documentation

Full documentation is available at:

📖 Official Docs — Complete user guide, API reference, architecture
Getting Started — Installation, quickstart, tutorials
User Guide — CLI reference, workflows, safety gates, policy packs
Architecture — Design, data models, components, JSONL format
API Reference — Detailed API documentation
Contributing — Development guide, testing, code quality

Deployment

GitHub Pages

Documentation is automatically deployed to GitHub Pages on every push to main:

https://0ai-cyberviser.github.io/PeachTree/

Docker

Run PeachTree in Docker:

# Build image
docker build -t peachtree:latest .

# Development environment with bash
docker-compose run --rm peachtree-dev

# Run tests
docker-compose run --rm peachtree-test

# Serve documentation locally
docker-compose run --rm -p 8000:8000 peachtree-docs
# Visit http://localhost:8000

Release Process

Releases are automated via GitHub Actions:

Update version in pyproject.toml
Update CHANGELOG.md
Create git tag: git tag v0.10.0
Push tag: git push origin v0.10.0
GitHub Actions will automatically:
- Run tests
- Build packages
- Create GitHub Release
- Publish to PyPI

See DEPLOYMENT.md for full deployment guide.

Quality

129 unit tests — 91% code coverage
Type checking — mypy with 0 errors
Linting — ruff with 0 violations
Pre-commit hooks — Automatic quality checks
CI/CD — GitHub Actions workflows for testing and deployment

Project Structure

peachtree/
├── src/peachtree/          # Core library
│   ├── builder.py          # DatasetBuilder
│   ├── models.py           # SourceDocument, DatasetRecord, DatasetManifest
│   ├── safety.py           # SafetyGate
│   ├── policy.py           # PolicyPack evaluation
│   ├── registry.py         # Registry and artifacts
│   └── ...
├── tests/                  # Unit tests (129 tests)
├── docs/                   # Documentation (42 markdown files)
├── .github/workflows/      # CI/CD workflows
├── Dockerfile              # Docker image
├── mkdocs.yml              # Documentation config
└── pyproject.toml          # Project metadata

Contributing

Contributions are welcome! See CONTRIBUTING.md for:

Development setup
Testing guidelines
Code quality standards
Pull request process
Development workflow

License

PeachTree is part of the 0AI / CyberViser project. See LICENSE for details.

Support

Issues — GitHub Issues for bug reports and feature requests
Discussions — GitHub Discussions for questions
Documentation — https://0ai-cyberviser.github.io/PeachTree/
Contributing — See CONTRIBUTING.md

Roadmap

LoRA fine-tuning support
Dataset versioning system
Web UI for dataset management
Integration with Weights & Biases
Advanced deduplication algorithms
Performance benchmarking
Extended ecosystem integrations

Status: Active Development | Version: 0.9.0 | Python: 3.10+

Built for the 0AI ecosystem by CyberViser

generated datasets are ignored by default until reviewed
trainer handoff commands are dry-run unless explicitly promoted outside PeachTree

Quick start

python3 -m venv ~/venvs/peachtree
source ~/venvs/peachtree/bin/activate
python -m pip install -e ".[dev]"

pytest -q

peachtree policy
peachtree plan --goal "Build PeachFuzz training data" --project peachfuzz
peachtree ingest-local --repo . --repo-name peachtree --output data/raw/peachtree.jsonl
peachtree build --source data/raw/peachtree.jsonl --dataset data/datasets/peachtree.jsonl --manifest data/manifests/peachtree.json --domain peachtree
peachtree audit --dataset data/datasets/peachtree.jsonl

Core workflows

Integrate with PeachFuzz

peachtree ingest-local --repo ~/peachfuzz --repo-name peachfuzz --output data/raw/peachfuzz.jsonl
peachtree build --source data/raw/peachfuzz.jsonl --dataset data/datasets/peachfuzz-instruct.jsonl --manifest data/manifests/peachfuzz.json --domain peachfuzz

Use this path for fuzz harness notes, crash triage reports, minimized reproducers, coverage findings, and safe corpus descriptions.

Integrate with Hancock

peachtree ingest-local --repo ~/Hancock --repo-name hancock --output data/raw/hancock.jsonl
peachtree build --source data/raw/hancock.jsonl --dataset data/datasets/hancock-instruct.jsonl --manifest data/manifests/hancock.json --domain hancock

Use this path for Hancock modes, API examples, SOC/PICERL triage records, Sigma/YARA examples, CISO summaries, and fuzzing-specialist training records.

Build from owned GitHub inventory

peachtree github-owned --owner 0ai-Cyberviser --limit 25 --output data/manifests/owned.jsonl
peachtree github-plan --inventory data/manifests/owned.jsonl
bash scripts/clone_owned_repos.sh
bash scripts/build_owned_datasets.sh

The connector inventories access-authorized repositories and generates reviewable scripts. Public GitHub-wide collection remains disabled by default.

Capability roadmap

Version	Capability	Status
v0.1.0	local recursive dataset engine	Complete
v0.2.x	review-first owned GitHub connector	Complete
v0.3.0	dependency graphs and lineage maps	Complete
v0.4.0	ChatML, Alpaca, and ShareGPT exporters	Complete
v0.5.0	scheduled dataset update PR workflow	Complete
v0.6.0	quality scoring, deduplication, readiness checks	Complete
v0.7.0	policy packs, license gates, model-card generation	Complete
v0.8.0	registries, SBOM/provenance, release bundles	Complete
v0.9.0	trainer handoff manifests and LoRA dry-run plans	Complete

Dependency graphs and lineage maps

PeachTree v0.3.0 adds local-only graph and lineage reports.

peachtree graph --inventory data/manifests/owned.jsonl --format mermaid --output reports/ecosystem-graph.mmd
peachtree lineage --dataset data/datasets/peachfuzz-instruct.jsonl --format markdown --output reports/peachfuzz-lineage.md
peachtree ecosystem --inventory data/manifests/owned.jsonl --output reports/ecosystem.json

These commands read local inventory, datasets, and manifests. They do not contact GitHub or train models.

Model exporter profiles

PeachTree v0.4.0 exports reviewed PeachTree datasets into ChatML, Alpaca, and ShareGPT JSONL.

peachtree export-formats
peachtree export --source data/datasets/peachfuzz-instruct.jsonl --format chatml --output data/exports/peachfuzz-chatml.jsonl
peachtree validate-export --format chatml --path data/exports/peachfuzz-chatml.jsonl

Exporters are local-only and preserve provenance metadata by default.

Scheduled dataset update PR workflow

PeachTree v0.5.0 adds review-first scheduled update tooling.

peachtree update-plan --repo ~/peachfuzz --repo-name 0ai-Cyberviser/peachfuzz --output data/manifests/update-plan.json
peachtree diff --baseline data/baseline/old.jsonl --candidate data/datasets/new.jsonl --format markdown
peachtree review-report --plan data/manifests/update-plan.json --output reports/update-review.json

The included GitHub Actions workflow opens pull requests for dataset updates. It does not train models, upload datasets, or push directly to main.

Dataset quality gates

PeachTree v0.6.0 adds quality scoring, deterministic deduplication, and training readiness checks.

peachtree score --dataset data/datasets/peachfuzz-instruct.jsonl --markdown-output reports/quality.md
peachtree dedup --source data/datasets/peachfuzz-instruct.jsonl --output data/datasets/peachfuzz-deduped.jsonl
peachtree readiness --dataset data/datasets/peachfuzz-deduped.jsonl --output reports/readiness.json

These commands are local-only and do not train models or upload datasets.

Dataset policy packs

PeachTree v0.7.0 adds policy-pack evaluation, license/compliance gates, and model-card generation.

peachtree policy-pack --list
peachtree license-gate --dataset data/datasets/peachfuzz-deduped.jsonl --markdown-output reports/license-gate.md
peachtree model-card --dataset data/datasets/peachfuzz-deduped.jsonl --model-name PeachFuzz-Dataset-v1 --output reports/model-card.md

These commands are local-only and generate review artifacts before downstream model training.

Dataset release bundles

PeachTree v0.8.0 adds dataset registries, artifact signing metadata, SBOM/provenance manifests, and release bundle creation.

peachtree registry data/datasets reports --output reports/registry.json
peachtree sbom --registry reports/registry.json --output reports/sbom.json
peachtree bundle data/datasets/example.jsonl reports/model-card.md --output dist/example-release.zip

These commands are local-only and do not train models, upload datasets, or scrape public GitHub.

Trainer handoff manifests

PeachTree v0.9.0 adds trainer handoff manifests, LoRA job cards, and dry-run training launch plans.

peachtree handoff --dataset data/exports/example-chatml.jsonl --model-name Example-Lora --base-model mistralai/Mistral-7B-Instruct-v0.3 --output reports/handoff.json
peachtree lora-card --dataset data/exports/example-chatml.jsonl --job-name example-lora --base-model mistralai/Mistral-7B-Instruct-v0.3 --output-dir outputs/example --output reports/lora-job-card.json
peachtree train-plan --job-card reports/lora-job-card.json --output reports/dry-run-training-plan.json

These commands are dry-run only and do not launch training.

Repository creation reference

git init
git branch -M main
git add .
git commit -m "feat: initial PeachTree recursive dataset engine"
gh repo create 0ai-Cyberviser/PeachTree --public --source=. --remote=origin --push

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.github		.github
config		config
data		data
datasets/seed		datasets/seed
docs		docs
examples		examples
scripts		scripts
src/peachtree		src/peachtree
tests		tests
.coverage		.coverage
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
0AI-PORTFOLIO-ADDITION.md		0AI-PORTFOLIO-ADDITION.md
0AI-PORTFOLIO-INTEGRATION.md		0AI-PORTFOLIO-INTEGRATION.md
===" >&2		===" >&2
ADD-TO-PORTFOLIO-GUIDE.md		ADD-TO-PORTFOLIO-GUIDE.md
AGENTS.md		AGENTS.md
APRIL-26-EOD-ACTION-CHECKLIST.md		APRIL-26-EOD-ACTION-CHECKLIST.md
BADGES.md		BADGES.md
CHANGELOG.md		CHANGELOG.md
CLI-REFERENCE.md		CLI-REFERENCE.md
CLOUD-DEPLOYMENT.md		CLOUD-DEPLOYMENT.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
COMPLETE-DEPLOYMENT-DOCUMENTATION-INDEX.md		COMPLETE-DEPLOYMENT-DOCUMENTATION-INDEX.md
COMPLETE-STATUS-REPORT.md		COMPLETE-STATUS-REPORT.md
COMPLETION-SUMMARY.md		COMPLETION-SUMMARY.md
CONTRIBUTING-ADVANCED.md		CONTRIBUTING-ADVANCED.md
CONTRIBUTING.md		CONTRIBUTING.md
COUNTDOWN-EXECUTION-REFERENCE.md		COUNTDOWN-EXECUTION-REFERENCE.md
CYBERVISERAI-INTEGRATION.md		CYBERVISERAI-INTEGRATION.md
DATASET-BUILD-COMPLETE.md		DATASET-BUILD-COMPLETE.md
DATASET-REGISTRY.md		DATASET-REGISTRY.md
DEPLOYMENT-PLAYBOOKS.md		DEPLOYMENT-PLAYBOOKS.md
DEPLOYMENT.md		DEPLOYMENT.md
DEVELOPMENT.md		DEVELOPMENT.md
DUAL-PROJECT-COORDINATION-APRIL-27-28.md		DUAL-PROJECT-COORDINATION-APRIL-27-28.md
Dockerfile		Dockerfile
ECOSYSTEM-ENHANCEMENTS-SUMMARY.md		ECOSYSTEM-ENHANCEMENTS-SUMMARY.md
EMAIL-DISTRIBUTION-TEMPLATES.md		EMAIL-DISTRIBUTION-TEMPLATES.md
EXECUTE-TODAY-APRIL-26.md		EXECUTE-TODAY-APRIL-26.md
EXECUTIVE-DEPLOYMENT-READY-DASHBOARD.md		EXECUTIVE-DEPLOYMENT-READY-DASHBOARD.md
FIX-GITHUB-AUTH.md		FIX-GITHUB-AUTH.md
INTEGRATION-TESTING.md		INTEGRATION-TESTING.md
MASTER-DEPLOYMENT-STATUS.md		MASTER-DEPLOYMENT-STATUS.md
MASTER-DEPLOYMENT-WEEKEND-PLAN.md		MASTER-DEPLOYMENT-WEEKEND-PLAN.md
MASTER-EXECUTION-CHECKLIST.md		MASTER-EXECUTION-CHECKLIST.md
MODEL-CARD-SECURITY-DATASET.md		MODEL-CARD-SECURITY-DATASET.md
MRCLEAN-PR6-REVIEW.md		MRCLEAN-PR6-REVIEW.md
MULTI-ORG-DATASET-README.md		MULTI-ORG-DATASET-README.md
MULTI-ORG-INTEGRATION.md		MULTI-ORG-INTEGRATION.md
MULTI-ORG-STATUS-REPORT.md		MULTI-ORG-STATUS-REPORT.md
NEXT-STEPS.md		NEXT-STEPS.md
PEACHTREE-CLI-COMMANDS.md		PEACHTREE-CLI-COMMANDS.md
PERFORMANCE-OPTIMIZATION.md		PERFORMANCE-OPTIMIZATION.md
PORTFOLIO-CARD.md		PORTFOLIO-CARD.md
PRINT-THIS-TONIGHT-QUICK-REFERENCE.md		PRINT-THIS-TONIGHT-QUICK-REFERENCE.md
PROJECT-STATUS-REPORT.md		PROJECT-STATUS-REPORT.md
QUICK-REFERENCE.md		QUICK-REFERENCE.md
QUICKSTART-SECURITY-DATASET.md		QUICKSTART-SECURITY-DATASET.md
README.md		README.md
RELEASE-NOTES-TEMPLATE.md		RELEASE-NOTES-TEMPLATE.md
REPOSITORY-SETTINGS.md		REPOSITORY-SETTINGS.md
RIGHT-NOW-ACTION.md		RIGHT-NOW-ACTION.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
STAKEHOLDER-COMMUNICATION-PACKAGE.md		STAKEHOLDER-COMMUNICATION-PACKAGE.md
TODAY-QUICK-START.md		TODAY-QUICK-START.md
TROUBLESHOOTING-PLAYBOOK.md		TROUBLESHOOTING-PLAYBOOK.md
TROUBLESHOOTING.md		TROUBLESHOOTING.md
VISUAL-DEPLOYMENT-TIMELINE.md		VISUAL-DEPLOYMENT-TIMELINE.md
WORKFLOW-HARNESS-GUIDE.md		WORKFLOW-HARNESS-GUIDE.md
build		build
commit-all.sh		commit-all.sh
commit_workflow.py		commit_workflow.py
coverage.json		coverage.json
docker-compose.yml		docker-compose.yml
e HEAD		e HEAD
mkdocs.yml		mkdocs.yml
mypy-report.json		mypy-report.json
on		on
pyproject.toml		pyproject.toml
ruff-report.json		ruff-report.json
ts('mkdocs.yml'):		ts('mkdocs.yml'):
v1.0-ENHANCEMENTS.md		v1.0-ENHANCEMENTS.md

Folders and files

Latest commit

History

Repository files navigation

PeachTree

What PeachTree does

Mission

Safety defaults

Quick Start

Installation

Build Your First Dataset

Features

Core Capabilities

Ecosystem Integration

Documentation

Deployment

GitHub Pages

Docker

Release Process

Quality

Project Structure

Contributing

License

Support

Roadmap

Quick start

Core workflows

Integrate with PeachFuzz

Integrate with Hancock

Build from owned GitHub inventory

Capability roadmap

Dependency graphs and lineage maps

Model exporter profiles

Scheduled dataset update PR workflow

Dataset quality gates

Dataset policy packs

Dataset release bundles

Trainer handoff manifests

Repository creation reference

About

Resources

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages