Skip to content

Consider adding a grammar-based synthetic-dataset generator for end-to-end content-correctness testing #40

@mmcdermott

Description

@mmcdermott

Filing as a follow-up pointer, not requesting anything immediately — referenced by mmcdermott/MEDS_EIC_AR#106.

Context

PR mmcdermott/MEDS_EIC_AR#106 adds a grammar-based synthetic MEDS dataset for end-to-end content-correctness testing of an autoregressive-generation CLI. The grammar is a small FSM over tokenized "programs" separated by SEP, with a reverse-mapping layer that serializes each FSM-valid sequence as a MEDS-format subject timeline (one event per token, 1-hour spacing). The MEDSDataset class from this library is used to write the resulting dataset.

The useful generalization: any autoregressive-generation test suite wants an end-to-end fixture where (a) the training distribution is fully specified and grammar-checkable, (b) the fixture round-trips cleanly through MEDS preprocessing, and (c) the generated output can be validated token-by-token against the FSM. Today every project writes its own variant of this from scratch.

What the current implementation looks like

Two-layer structure (in PR #106):

  1. Grammar layer (pure Python, no dependencies): a small FSM with named "programs," each a fixed-length token sequence, separated by a SEP token. sample_sequence(rng, max_len) produces a grammar-valid sequence.
  2. MEDS layer: a TOKEN_TO_CODE dict (e.g. SEP"GRAMMAR//SEP", program_A[i]"GRAMMAR//A//{i}"), plus build_grammar_meds_dataset(out_dir, task_labels_dir, *, n_train, n_tuning, n_held_out, seed, max_len) that:
    • Samples a grammar-valid sequence per subject
    • Emits events at fixed 1-hour spacing
    • Places a task label's prediction_time strictly between the SEP event and the next event (avoids an off-by-one from side="right" search_sorted — see MEDS_EIC_AR#111)
    • Constructs a MEDSDataset with proper data_shards, code_metadata, subject_splits, task_labels and calls .write(out_dir)

Plus an FSM walker prompt_grammar_tokens_by_subject that extracts per-subject prompt grammar tokens (for establishing FSM state before evaluating generated continuations).

What would be nice upstream

  • A generic GrammarMEDSGenerator class parameterized over:
    • Program definitions (name → token sequence)
    • SEP token, vocab mapping
    • Number of subjects per split
    • Events-per-hour spacing (or arbitrary timedelta function)
    • Task-label placement strategy (boundary-aligned, midpoint, fixed-offset, etc.)
  • Returning a MEDSDataset directly that users can .write() or compose with other testing helpers.
  • Shared FSM walker for validating generated token sequences post-generation.

The PR #106 code is ~330 lines and mostly reusable — could be ported with minor generalization to lift the specific PROGRAMS grammar into a constructor argument.

No action needed from me; just leaving this pointer so the pattern is discoverable when someone else runs into the same testing problem.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions