Skip to content

Latest commit

 

History

History
132 lines (92 loc) · 4.4 KB

File metadata and controls

132 lines (92 loc) · 4.4 KB

Annotated output

README · Docs · Guides · Packages

--annotated outputs JSONL where each record contains the formatted text and byte-offset spans for every generated value. Works with any output mode — structured (CSV, TSV, JSONL, SQL) and templates.

Contents

Usage

# Structured output
seedfaker name email phone:e164 -n 1000 --annotated --seed demo --format csv
seedfaker name email -n 5000 --annotated --format jsonl

# Template output
seedfaker run pii-leak -n 1000 --annotated --seed demo
seedfaker run nginx --annotated -n 5000

# With corruption
seedfaker run pii-leak -n 100000 --annotated --corrupt mid --seed train

# Pipe to NER trainer
seedfaker run pii-leak -n 100000 --annotated --corrupt mid | python train_ner.py

Also available as config option: annotated: true.

Output format

Each line is a JSON object with text and spans:

Structured (CSV)

seedfaker name email phone:e164 -n 1 --annotated --seed demo --format csv
{
  "text": "Paulina Laca,im.ivana@eunet.rs,+278458384682",
  "spans": [
    { "s": 0, "e": 12, "f": "name", "v": "Paulina Laca" },
    { "s": 13, "e": 30, "f": "email", "v": "im.ivana@eunet.rs" },
    { "s": 31, "e": 44, "f": "phone", "v": "+278458384682" }
  ]
}

Template (pii-leak)

seedfaker run pii-leak -n 1 --annotated --seed demo
{
  "text": "--- CRM Note | 2025-05-25T21:12:48Z | Agent: spyros_papageorghiou_vip ---\nContact: Spyros Papageorghiou <spyros@icloud.com>",
  "spans": [
    { "s": 15, "e": 35, "f": "timestamp", "v": "2025-05-25T21:12:48Z" },
    { "s": 45, "e": 69, "f": "username", "v": "spyros_papageorghiou_vip" },
    { "s": 83, "e": 103, "f": "name", "v": "Spyros Papageorghiou" },
    { "s": 105, "e": 122, "f": "email", "v": "spyros@icloud.com" }
  ]
}

With corruption

Corrupted spans include the pre-corruption value in o:

{ "s": 9, "e": 11, "f": "name", "v": "Sp", "o": "Spyros Papageorghiou" }

Span keys

Key Description
s Start byte offset in text (inclusive)
e End byte offset in text (exclusive). text[s..e] == v
f Field type from registry (name, email, ssn, integer, serial, ...)
v Generated value — always matches text[s..e]
o Pre-corruption value (only present if corrupted)

What produces spans

Structured output: every column value produces a span.

Template output: every {{...}} substitution — declared columns, inline fields, enums, serial. Template literal text does not produce spans.

With --corrupt, all generated values (including inline fields in templates) can be corrupted.

Use cases

PII scanner benchmarks. Generate text with known PII locations, run your scanner, compare output spans against ground truth.

seedfaker run pii-leak -n 10000 --annotated --seed bench | python eval_scanner.py

NER model training. Output format is compatible with Prodigy/doccano annotation format — text + spans with character offsets.

seedfaker run pii-leak -n 100000 --annotated --corrupt mid --seed train > train.jsonl

Data quality analysis. Inspect what was generated and where corruption changed values.

seedfaker name email -n 100 --annotated --corrupt extreme --format csv --seed qa

Examples

Related guides


README · Docs · Guides · Packages