Part of the CyberViser AI ecosystem
PeachTree is the recursive learning-tree dataset engine for CyberViser / 0AI.
It turns owned repositories, documentation, tests, fuzz reports, issue notes, telemetry summaries, and architecture plans into traceable, safe, deduplicated JSONL datasets for model training and security-assurance workflows.
🌐 Company: cyberviserai.com
📚 Documentation: 0ai-cyberviser.github.io/PeachTree
📦 Repository: github.com/0ai-Cyberviser/PeachTree
PeachTree is designed as a shared dependency for:
- PeachFuzz — fuzzing findings, crash triage, and regression corpora
See it in action at cyberviserai.com - Hancock — cybersecurity LLM agent datasets
- CactusFuzz — adversarial testing and guardrail validation
- 0AI Portfolio — broader CyberViser / 0AI ecosystem coordination
PeachTree provides the review-first data layer between raw project material and downstream model/fuzzing workflows.
flowchart TD
A[Training Goal] --> B[Recursive Learning Tree]
B --> C[Owned Source Collection]
C --> D[Safety + License Gate]
D --> E[Dataset Builder]
E --> F[JSONL Training Dataset]
E --> G[Manifest + Provenance]
E --> H[PeachFuzz Seeds]
G --> I[Gap Analysis]
I --> B
PeachTree helps CyberViser build datasets that are:
- traceable — every record keeps source and provenance metadata
- safe — secrets, tokens, private keys, and unsafe records are blocked or flagged
- deduplicated — deterministic dataset reduction avoids repeated examples
- reviewable — plans, manifests, diffs, model cards, and release bundles are generated before publication
- local-first — no broad public scraping or automatic training launches by default
PeachTree does not blindly scrape GitHub.
- local/owned repository ingestion is enabled first
- public GitHub collection is disabled by default
- public collection requires explicit opt-in, license allowlists, rate limits, and provenance
- secret/token/private-key patterns are blocked
- provenance metadata is attached to every record
# From PyPI
pip install peachtree-ai
# From source
git clone https://github.com/0ai-Cyberviser/PeachTree.git
cd PeachTree
pip install -e ".[dev]"# Plan dataset structure
peachtree plan --source /path/to/repo --goal "Security training data"
# Ingest repository
peachtree ingest --repo /path/to/repo --output data/
# Build JSONL dataset
peachtree build --input data/ --output dataset.jsonl
# Validate dataset
peachtree audit --dataset dataset.jsonl
# Check policy compliance
peachtree policy --dataset dataset.jsonl --pack safety- Recursive Learning Trees — Define dataset structures with branching, filtering, and composition
- Safety Gates — Automatic secret detection, license filtering, content validation
- Policy Packs — Composable compliance rules for quality, deduplication, safety
- JSONL Datasets — Full provenance: source, path, commit digest, timestamps
- Deduplication — Content-hash, semantic, and fuzzy matching (3 methods)
- Quality Scoring — Automated metrics for dataset quality and completeness
- Release Bundles — SBOM, signatures, model cards, trainer handoff manifests
- Hancock — Cybersecurity LLM agent datasets
- PeachFuzz — Fuzzing corpora and crash triage
- CyberViser — Project hub and documentation surface
- 0AI — Broader ecosystem coordination
Full documentation is available at:
- 📖 Official Docs — Complete user guide, API reference, architecture
- Getting Started — Installation, quickstart, tutorials
- User Guide — CLI reference, workflows, safety gates, policy packs
- Architecture — Design, data models, components, JSONL format
- API Reference — Detailed API documentation
- Contributing — Development guide, testing, code quality
Documentation is automatically deployed to GitHub Pages on every push to main:
https://0ai-cyberviser.github.io/PeachTree/Run PeachTree in Docker:
# Build image
docker build -t peachtree:latest .
# Development environment with bash
docker-compose run --rm peachtree-dev
# Run tests
docker-compose run --rm peachtree-test
# Serve documentation locally
docker-compose run --rm -p 8000:8000 peachtree-docs
# Visit http://localhost:8000Releases are automated via GitHub Actions:
- Update version in
pyproject.toml - Update
CHANGELOG.md - Create git tag:
git tag v0.10.0 - Push tag:
git push origin v0.10.0 - GitHub Actions will automatically:
- Run tests
- Build packages
- Create GitHub Release
- Publish to PyPI
See DEPLOYMENT.md for full deployment guide.
- 129 unit tests — 91% code coverage
- Type checking — mypy with 0 errors
- Linting — ruff with 0 violations
- Pre-commit hooks — Automatic quality checks
- CI/CD — GitHub Actions workflows for testing and deployment
peachtree/
├── src/peachtree/ # Core library
│ ├── builder.py # DatasetBuilder
│ ├── models.py # SourceDocument, DatasetRecord, DatasetManifest
│ ├── safety.py # SafetyGate
│ ├── policy.py # PolicyPack evaluation
│ ├── registry.py # Registry and artifacts
│ └── ...
├── tests/ # Unit tests (129 tests)
├── docs/ # Documentation (42 markdown files)
├── .github/workflows/ # CI/CD workflows
├── Dockerfile # Docker image
├── mkdocs.yml # Documentation config
└── pyproject.toml # Project metadata
Contributions are welcome! See CONTRIBUTING.md for:
- Development setup
- Testing guidelines
- Code quality standards
- Pull request process
- Development workflow
PeachTree is part of the 0AI / CyberViser project. See LICENSE for details.
- Issues — GitHub Issues for bug reports and feature requests
- Discussions — GitHub Discussions for questions
- Documentation — https://0ai-cyberviser.github.io/PeachTree/
- Contributing — See CONTRIBUTING.md
- LoRA fine-tuning support
- Dataset versioning system
- Web UI for dataset management
- Integration with Weights & Biases
- Advanced deduplication algorithms
- Performance benchmarking
- Extended ecosystem integrations
Status: Active Development | Version: 0.9.0 | Python: 3.10+
Built for the 0AI ecosystem by CyberViser
- generated datasets are ignored by default until reviewed
- trainer handoff commands are dry-run unless explicitly promoted outside PeachTree
python3 -m venv ~/venvs/peachtree
source ~/venvs/peachtree/bin/activate
python -m pip install -e ".[dev]"
pytest -q
peachtree policy
peachtree plan --goal "Build PeachFuzz training data" --project peachfuzz
peachtree ingest-local --repo . --repo-name peachtree --output data/raw/peachtree.jsonl
peachtree build --source data/raw/peachtree.jsonl --dataset data/datasets/peachtree.jsonl --manifest data/manifests/peachtree.json --domain peachtree
peachtree audit --dataset data/datasets/peachtree.jsonlpeachtree ingest-local --repo ~/peachfuzz --repo-name peachfuzz --output data/raw/peachfuzz.jsonl
peachtree build --source data/raw/peachfuzz.jsonl --dataset data/datasets/peachfuzz-instruct.jsonl --manifest data/manifests/peachfuzz.json --domain peachfuzzUse this path for fuzz harness notes, crash triage reports, minimized reproducers, coverage findings, and safe corpus descriptions.
peachtree ingest-local --repo ~/Hancock --repo-name hancock --output data/raw/hancock.jsonl
peachtree build --source data/raw/hancock.jsonl --dataset data/datasets/hancock-instruct.jsonl --manifest data/manifests/hancock.json --domain hancockUse this path for Hancock modes, API examples, SOC/PICERL triage records, Sigma/YARA examples, CISO summaries, and fuzzing-specialist training records.
peachtree github-owned --owner 0ai-Cyberviser --limit 25 --output data/manifests/owned.jsonl
peachtree github-plan --inventory data/manifests/owned.jsonl
bash scripts/clone_owned_repos.sh
bash scripts/build_owned_datasets.shThe connector inventories access-authorized repositories and generates reviewable scripts. Public GitHub-wide collection remains disabled by default.
| Version | Capability | Status |
|---|---|---|
| v0.1.0 | local recursive dataset engine | Complete |
| v0.2.x | review-first owned GitHub connector | Complete |
| v0.3.0 | dependency graphs and lineage maps | Complete |
| v0.4.0 | ChatML, Alpaca, and ShareGPT exporters | Complete |
| v0.5.0 | scheduled dataset update PR workflow | Complete |
| v0.6.0 | quality scoring, deduplication, readiness checks | Complete |
| v0.7.0 | policy packs, license gates, model-card generation | Complete |
| v0.8.0 | registries, SBOM/provenance, release bundles | Complete |
| v0.9.0 | trainer handoff manifests and LoRA dry-run plans | Complete |
PeachTree v0.3.0 adds local-only graph and lineage reports.
peachtree graph --inventory data/manifests/owned.jsonl --format mermaid --output reports/ecosystem-graph.mmd
peachtree lineage --dataset data/datasets/peachfuzz-instruct.jsonl --format markdown --output reports/peachfuzz-lineage.md
peachtree ecosystem --inventory data/manifests/owned.jsonl --output reports/ecosystem.jsonThese commands read local inventory, datasets, and manifests. They do not contact GitHub or train models.
PeachTree v0.4.0 exports reviewed PeachTree datasets into ChatML, Alpaca, and ShareGPT JSONL.
peachtree export-formats
peachtree export --source data/datasets/peachfuzz-instruct.jsonl --format chatml --output data/exports/peachfuzz-chatml.jsonl
peachtree validate-export --format chatml --path data/exports/peachfuzz-chatml.jsonlExporters are local-only and preserve provenance metadata by default.
PeachTree v0.5.0 adds review-first scheduled update tooling.
peachtree update-plan --repo ~/peachfuzz --repo-name 0ai-Cyberviser/peachfuzz --output data/manifests/update-plan.json
peachtree diff --baseline data/baseline/old.jsonl --candidate data/datasets/new.jsonl --format markdown
peachtree review-report --plan data/manifests/update-plan.json --output reports/update-review.jsonThe included GitHub Actions workflow opens pull requests for dataset updates. It does not train models, upload datasets, or push directly to main.
PeachTree v0.6.0 adds quality scoring, deterministic deduplication, and training readiness checks.
peachtree score --dataset data/datasets/peachfuzz-instruct.jsonl --markdown-output reports/quality.md
peachtree dedup --source data/datasets/peachfuzz-instruct.jsonl --output data/datasets/peachfuzz-deduped.jsonl
peachtree readiness --dataset data/datasets/peachfuzz-deduped.jsonl --output reports/readiness.jsonThese commands are local-only and do not train models or upload datasets.
PeachTree v0.7.0 adds policy-pack evaluation, license/compliance gates, and model-card generation.
peachtree policy-pack --list
peachtree license-gate --dataset data/datasets/peachfuzz-deduped.jsonl --markdown-output reports/license-gate.md
peachtree model-card --dataset data/datasets/peachfuzz-deduped.jsonl --model-name PeachFuzz-Dataset-v1 --output reports/model-card.mdThese commands are local-only and generate review artifacts before downstream model training.
PeachTree v0.8.0 adds dataset registries, artifact signing metadata, SBOM/provenance manifests, and release bundle creation.
peachtree registry data/datasets reports --output reports/registry.json
peachtree sbom --registry reports/registry.json --output reports/sbom.json
peachtree bundle data/datasets/example.jsonl reports/model-card.md --output dist/example-release.zipThese commands are local-only and do not train models, upload datasets, or scrape public GitHub.
PeachTree v0.9.0 adds trainer handoff manifests, LoRA job cards, and dry-run training launch plans.
peachtree handoff --dataset data/exports/example-chatml.jsonl --model-name Example-Lora --base-model mistralai/Mistral-7B-Instruct-v0.3 --output reports/handoff.json
peachtree lora-card --dataset data/exports/example-chatml.jsonl --job-name example-lora --base-model mistralai/Mistral-7B-Instruct-v0.3 --output-dir outputs/example --output reports/lora-job-card.json
peachtree train-plan --job-card reports/lora-job-card.json --output reports/dry-run-training-plan.jsonThese commands are dry-run only and do not launch training.
git init
git branch -M main
git add .
git commit -m "feat: initial PeachTree recursive dataset engine"
gh repo create 0ai-Cyberviser/PeachTree --public --source=. --remote=origin --push