Revise README with updated project details#4
Conversation
Updated the README to reflect new features and descriptions for DeepZero, including changes to the pipeline architecture and installation instructions.
There was a problem hiding this comment.
Pull request overview
Updates the project README to reflect DeepZero’s current positioning (agentic vulnerability research pipeline), pipeline architecture primitives, CLI usage, installation extras, and processor authoring guidance.
Changes:
- Rewrites the top-level project description and architecture section (Ingest/Map/BulkMap/Reduce).
- Expands installation and CLI usage examples (run/resume/status/validate/list-processors/init/interactive/serve).
- Adds detailed YAML pipeline anatomy, processor reference formats, built-in processor list, and repository structure.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| deepzero interactive -m openai/gpt-4o | ||
|
|
||
| # Start the REST API server | ||
| deepzero serve --host 127.0.0.1 --port 8420 -w work/ |
There was a problem hiding this comment.
The deepzero serve example uses -w work/, but run state is written under the pipeline-specific work dir (settings.work_dir + / + pipeline.name). With the current engine, work/ will typically not contain run.json, so the API will show no data. Update the README example to point at the pipeline work dir (e.g. -w work/loldrivers).
| deepzero serve --host 127.0.0.1 --port 8420 -w work/ | |
| deepzero serve --host 127.0.0.1 --port 8420 -w work/loldrivers |
| # Stage 3: external processor from processors/ directory | ||
| - name: decompile | ||
| processor: my_decompiler/my_decompiler.py | ||
| parallel: 0 # 0 = use max_workers from settings |
There was a problem hiding this comment.
In the YAML example, parallel: 0 is documented as "use max_workers from settings", but the runner treats parallel <= 0 as auto-scaling to os.cpu_count() (and does not use settings.max_workers for stage parallelism). Please adjust the README comment (and/or the example) so it matches the actual behavior.
| parallel: 0 # 0 = use max_workers from settings | |
| parallel: 0 # 0 or less = auto-scale to available CPU count |
| processor: file_discovery # bare name = built-in processor | ||
| config: | ||
| extensions: [".exe", ".sys"] | ||
| extensions: ["*"] |
There was a problem hiding this comment.
The YAML example sets extensions: ["*"] for file_discovery, but the current implementation treats each entry as a file extension and prefixes a dot when missing (so * becomes .* and the glob becomes *.*, which won’t match files without a dot). If the intent is “all files”, omit extensions or set it to an empty list instead.
| extensions: ["*"] | |
| extensions: [] |
| require: | ||
| is_executable: true |
There was a problem hiding this comment.
The YAML example’s metadata_filter stage requires is_executable: true, but the preceding file_discovery ingest stage only emits sha256 and size_bytes (no is_executable). As written, this filter will exclude every sample. Consider changing the example to filter on fields that actually exist (e.g. min_size_bytes) or switching the ingest stage to one that provides is_executable metadata.
| require: | |
| is_executable: true | |
| min_size_bytes: 1 |
Updated the README to reflect new features and descriptions for DeepZero, including changes to the pipeline architecture and installation instructions.