Consider adding a grammar-based synthetic-dataset generator for end-to-end content-correctness testing

Filing as a follow-up pointer, not requesting anything immediately — referenced by [mmcdermott/MEDS_EIC_AR#106](https://github.com/mmcdermott/MEDS_EIC_AR/pull/106).

## Context

PR mmcdermott/MEDS_EIC_AR#106 adds a grammar-based synthetic MEDS dataset for end-to-end content-correctness testing of an autoregressive-generation CLI. The grammar is a small FSM over tokenized "programs" separated by `SEP`, with a reverse-mapping layer that serializes each FSM-valid sequence as a MEDS-format subject timeline (one event per token, 1-hour spacing). The `MEDSDataset` class from this library is used to write the resulting dataset.

The useful generalization: any autoregressive-generation test suite wants an end-to-end fixture where (a) the training distribution is fully specified and grammar-checkable, (b) the fixture round-trips cleanly through MEDS preprocessing, and (c) the generated output can be validated token-by-token against the FSM. Today every project writes its own variant of this from scratch.

## What the current implementation looks like

Two-layer structure (in PR #106):

1. **Grammar layer** (pure Python, no dependencies): a small FSM with named "programs," each a fixed-length token sequence, separated by a `SEP` token. `sample_sequence(rng, max_len)` produces a grammar-valid sequence.
2. **MEDS layer**: a `TOKEN_TO_CODE` dict (e.g. `SEP` → `"GRAMMAR//SEP"`, `program_A[i]` → `"GRAMMAR//A//{i}"`), plus `build_grammar_meds_dataset(out_dir, task_labels_dir, *, n_train, n_tuning, n_held_out, seed, max_len)` that:
    - Samples a grammar-valid sequence per subject
    - Emits events at fixed 1-hour spacing
    - Places a task label's `prediction_time` strictly between the SEP event and the next event (avoids an off-by-one from `side="right"` search_sorted — see [MEDS_EIC_AR#111](https://github.com/mmcdermott/MEDS_EIC_AR/issues/111))
    - Constructs a `MEDSDataset` with proper `data_shards`, `code_metadata`, `subject_splits`, `task_labels` and calls `.write(out_dir)`

Plus an FSM walker `prompt_grammar_tokens_by_subject` that extracts per-subject prompt grammar tokens (for establishing FSM state before evaluating generated continuations).

## What would be nice upstream

- A generic `GrammarMEDSGenerator` class parameterized over:
    - Program definitions (name → token sequence)
    - SEP token, vocab mapping
    - Number of subjects per split
    - Events-per-hour spacing (or arbitrary timedelta function)
    - Task-label placement strategy (boundary-aligned, midpoint, fixed-offset, etc.)
- Returning a `MEDSDataset` directly that users can `.write()` or compose with other testing helpers.
- Shared FSM walker for validating generated token sequences post-generation.

The PR #106 code is ~330 lines and mostly reusable — could be ported with minor generalization to lift the specific `PROGRAMS` grammar into a constructor argument.

No action needed from me; just leaving this pointer so the pattern is discoverable when someone else runs into the same testing problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider adding a grammar-based synthetic-dataset generator for end-to-end content-correctness testing #40

Context

What the current implementation looks like

What would be nice upstream

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consider adding a grammar-based synthetic-dataset generator for end-to-end content-correctness testing #40

Description

Context

What the current implementation looks like

What would be nice upstream

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions