|
| 1 | +# v5 Model Selection Requirements |
| 2 | + |
| 3 | +This sheet defines requirements for revisiting DataFog's optional model stack before |
| 4 | +locking the v5 core API around specific NLP/OCR backends. It is intentionally a |
| 5 | +requirements document, not a model recommendation list. |
| 6 | + |
| 7 | +## Decision Goals |
| 8 | + |
| 9 | +- Pick models that improve adoption by making the first successful result easy, |
| 10 | + trustworthy, and local by default. |
| 11 | +- Keep the core SDK fast and lightweight; model-backed engines remain optional. |
| 12 | +- Make model behavior explicit enough that users can defend it in privacy, |
| 13 | + security, and compliance reviews. |
| 14 | +- Preserve a clean path for future backend swaps without breaking the top-level |
| 15 | + v5 API. |
| 16 | + |
| 17 | +## Must-Haves |
| 18 | + |
| 19 | +### Runtime And Packaging |
| 20 | + |
| 21 | +- No model downloads during import, install, or ordinary SDK calls. |
| 22 | +- All model downloads must be explicit CLI/API actions or user-provided local |
| 23 | + paths. |
| 24 | +- The core install must not require ML, OCR, Torch, TensorFlow, Java, Spark, or |
| 25 | + system OCR binaries. |
| 26 | +- Optional extras must map cleanly to real imports: |
| 27 | + - `nlp` for lightweight NLP engines. |
| 28 | + - `nlp-advanced` for heavier ML NER engines. |
| 29 | + - `ocr` for local image/OCR processing. |
| 30 | + - `distributed` for Spark-style processing. |
| 31 | +- Missing dependency and missing model errors must explain the exact install or |
| 32 | + download command. |
| 33 | +- Python 3.10, 3.11, and 3.12 must be supported for advertised optional model |
| 34 | + profiles. Python 3.13 support should be advertised only after explicit profile |
| 35 | + validation. |
| 36 | +- Models must work in offline mode after explicit download/cache preparation. |
| 37 | + |
| 38 | +### Privacy And Trust |
| 39 | + |
| 40 | +- No network access during inference. |
| 41 | +- No telemetry, remote callbacks, model hub lookups, or license checks during |
| 42 | + inference. |
| 43 | +- No raw PII should be written to logs, cache names, telemetry, exceptions, or |
| 44 | + debug traces by default. |
| 45 | +- Model metadata exposed by DataFog should identify model name/version/source |
| 46 | + without storing detected raw PII. |
| 47 | +- Reversible workflows must be opt-in and clearly separated from ordinary |
| 48 | + redaction. |
| 49 | + |
| 50 | +### Detection Contract |
| 51 | + |
| 52 | +- Model outputs must include enough structure for the public result contract: |
| 53 | + entity type, text/span, start/end offsets, confidence when available, and |
| 54 | + engine/source. |
| 55 | +- Spans must be deterministic for the same model, text, and settings. |
| 56 | +- Entity labels must be mappable into DataFog's canonical entity taxonomy without |
| 57 | + surprising users. |
| 58 | +- Model-backed engines must compose with regex detection without duplicating or |
| 59 | + overwriting high-confidence structured entities. |
| 60 | +- Failure modes must be predictable: unsupported language, missing model, missing |
| 61 | + optional dependency, and low-confidence results should all be distinguishable. |
| 62 | + |
| 63 | +### Quality Gates |
| 64 | + |
| 65 | +- Candidate models must be benchmarked on DataFog's target corpora before |
| 66 | + adoption. |
| 67 | +- Benchmarks must include precision/recall by entity type, not only aggregate F1. |
| 68 | +- Structured PII such as email, phone, IP address, SSN, credit cards, dates, and |
| 69 | + ZIP/postal codes should remain regex/validator-first unless a model clearly |
| 70 | + improves quality. |
| 71 | +- NER-style entities such as person, organization, location, address, and |
| 72 | + domain-specific identifiers need regression tests with realistic app/log data. |
| 73 | +- OCR models must be evaluated separately for text extraction quality and PII |
| 74 | + extraction quality after OCR. |
| 75 | + |
| 76 | +### Operational Fit |
| 77 | + |
| 78 | +- CPU inference must be acceptable for the default advertised workflow. |
| 79 | +- GPU-only models are not acceptable as default engines. |
| 80 | +- Model size, cold-start time, memory use, and cache footprint must be measured. |
| 81 | +- The model must have a usable open license for commercial SDK users. |
| 82 | +- The model or provider must have credible maintenance signals and versioned |
| 83 | + artifacts. |
| 84 | + |
| 85 | +## Nice-To-Haves |
| 86 | + |
| 87 | +- Strong multilingual support with per-language quality reporting. |
| 88 | +- Quantized or small variants that keep local inference practical. |
| 89 | +- ONNX or other portable runtime support for future non-Torch deployments. |
| 90 | +- Streaming/chunked inference support or predictable behavior across chunk |
| 91 | + boundaries. |
| 92 | +- Custom entity hints or user-provided label sets. |
| 93 | +- Confidence calibration good enough to expose threshold controls. |
| 94 | +- Batch inference APIs for logs, CSV, and JSONL workflows. |
| 95 | +- Clear model cards with training data notes, limitations, and intended use. |
| 96 | +- Support for local cache directories that can be controlled by environment |
| 97 | + variable or explicit config. |
| 98 | +- Graceful operation on Apple Silicon and common Linux CI runners. |
| 99 | + |
| 100 | +## Disqualifiers |
| 101 | + |
| 102 | +- Requires network access for inference. |
| 103 | +- Downloads weights implicitly from ordinary SDK calls. |
| 104 | +- License is unclear, non-commercial, or incompatible with SDK distribution. |
| 105 | +- Requires a hosted API for core value. |
| 106 | +- Requires GPU for reasonable first-use behavior. |
| 107 | +- Cannot return stable spans or forces only label-level output. |
| 108 | +- Emits raw text or entities through logging, telemetry, or callbacks. |
| 109 | +- Adds heavyweight dependencies to the core install. |
| 110 | +- Breaks Python version support we already advertise. |
| 111 | + |
| 112 | +## Evaluation Matrix |
| 113 | + |
| 114 | +Each candidate backend should be scored before adoption: |
| 115 | + |
| 116 | +| Area | Required Evidence | |
| 117 | +| --- | --- | |
| 118 | +| Install footprint | Extra name, package deps, wheel size impact, system deps | |
| 119 | +| Runtime footprint | Cold start, warm latency, memory, CPU/GPU requirements | |
| 120 | +| Offline behavior | Explicit download path, local cache path, no-network test | |
| 121 | +| Quality | Precision/recall by entity type on DataFog corpora | |
| 122 | +| Span quality | Offset correctness and deduplication behavior | |
| 123 | +| Privacy | No raw PII logs/cache/telemetry, safe error messages | |
| 124 | +| Licensing | Model license, dependency licenses, commercial use notes | |
| 125 | +| Maintenance | Release cadence, Python compatibility, issue activity | |
| 126 | +| API fit | Entity taxonomy mapping, confidence support, batch/chunk support | |
| 127 | +| Docs fit | Model card, limitations, user-facing setup instructions | |
| 128 | + |
| 129 | +## Candidate Backend Categories To Evaluate |
| 130 | + |
| 131 | +- Regex plus validators for structured PII and secrets. |
| 132 | +- Lightweight NLP NER for person, organization, location, and address entities. |
| 133 | +- Advanced local NER models for broader entity coverage and multilingual support. |
| 134 | +- OCR text extraction engines for local images/PDF-derived images. |
| 135 | +- Document understanding models only if they beat OCR plus text PII extraction |
| 136 | + enough to justify their footprint. |
| 137 | +- User-provided backend hooks for teams that already have a preferred model. |
| 138 | + |
| 139 | +## Recommended Selection Policy |
| 140 | + |
| 141 | +- Default v5 behavior should remain regex/validator-first. |
| 142 | +- Model-backed engines should be opt-in by engine, policy, or extra. |
| 143 | +- DataFog should prefer smaller, reliable local models over maximum leaderboard |
| 144 | + scores if they improve install success and first-use latency. |
| 145 | +- Model choices should be version-pinned in docs and CI once advertised. |
| 146 | +- A model can be experimental in docs/examples before it becomes part of the |
| 147 | + supported contract. |
| 148 | + |
| 149 | +## Open Questions |
| 150 | + |
| 151 | +- Do we want one recommended advanced NER model, or a pluggable registry with a |
| 152 | + default? |
| 153 | +- Should OCR stay Tesseract-first, or should v5 introduce a newer local OCR |
| 154 | + default after benchmarking? |
| 155 | +- How much multilingual quality is required for v5.0.0 versus a later release? |
| 156 | +- Should Python 3.13 optional-profile support be a v4.5 compatibility release, |
| 157 | + a v5 launch requirement, or both? |
| 158 | +- What maximum model download size is acceptable for the default recommended |
| 159 | + advanced profile? |
0 commit comments