Filing as a follow-up pointer, not requesting anything immediately — referenced by mmcdermott/MEDS_EIC_AR#106.
Context
PR mmcdermott/MEDS_EIC_AR#106 adds a grammar-based synthetic MEDS dataset for end-to-end content-correctness testing of an autoregressive-generation CLI. The grammar is a small FSM over tokenized "programs" separated by SEP, with a reverse-mapping layer that serializes each FSM-valid sequence as a MEDS-format subject timeline (one event per token, 1-hour spacing). The MEDSDataset class from this library is used to write the resulting dataset.
The useful generalization: any autoregressive-generation test suite wants an end-to-end fixture where (a) the training distribution is fully specified and grammar-checkable, (b) the fixture round-trips cleanly through MEDS preprocessing, and (c) the generated output can be validated token-by-token against the FSM. Today every project writes its own variant of this from scratch.
What the current implementation looks like
Two-layer structure (in PR #106):
- Grammar layer (pure Python, no dependencies): a small FSM with named "programs," each a fixed-length token sequence, separated by a
SEP token. sample_sequence(rng, max_len) produces a grammar-valid sequence.
- MEDS layer: a
TOKEN_TO_CODE dict (e.g. SEP → "GRAMMAR//SEP", program_A[i] → "GRAMMAR//A//{i}"), plus build_grammar_meds_dataset(out_dir, task_labels_dir, *, n_train, n_tuning, n_held_out, seed, max_len) that:
- Samples a grammar-valid sequence per subject
- Emits events at fixed 1-hour spacing
- Places a task label's
prediction_time strictly between the SEP event and the next event (avoids an off-by-one from side="right" search_sorted — see MEDS_EIC_AR#111)
- Constructs a
MEDSDataset with proper data_shards, code_metadata, subject_splits, task_labels and calls .write(out_dir)
Plus an FSM walker prompt_grammar_tokens_by_subject that extracts per-subject prompt grammar tokens (for establishing FSM state before evaluating generated continuations).
What would be nice upstream
- A generic
GrammarMEDSGenerator class parameterized over:
- Program definitions (name → token sequence)
- SEP token, vocab mapping
- Number of subjects per split
- Events-per-hour spacing (or arbitrary timedelta function)
- Task-label placement strategy (boundary-aligned, midpoint, fixed-offset, etc.)
- Returning a
MEDSDataset directly that users can .write() or compose with other testing helpers.
- Shared FSM walker for validating generated token sequences post-generation.
The PR #106 code is ~330 lines and mostly reusable — could be ported with minor generalization to lift the specific PROGRAMS grammar into a constructor argument.
No action needed from me; just leaving this pointer so the pattern is discoverable when someone else runs into the same testing problem.
Filing as a follow-up pointer, not requesting anything immediately — referenced by mmcdermott/MEDS_EIC_AR#106.
Context
PR mmcdermott/MEDS_EIC_AR#106 adds a grammar-based synthetic MEDS dataset for end-to-end content-correctness testing of an autoregressive-generation CLI. The grammar is a small FSM over tokenized "programs" separated by
SEP, with a reverse-mapping layer that serializes each FSM-valid sequence as a MEDS-format subject timeline (one event per token, 1-hour spacing). TheMEDSDatasetclass from this library is used to write the resulting dataset.The useful generalization: any autoregressive-generation test suite wants an end-to-end fixture where (a) the training distribution is fully specified and grammar-checkable, (b) the fixture round-trips cleanly through MEDS preprocessing, and (c) the generated output can be validated token-by-token against the FSM. Today every project writes its own variant of this from scratch.
What the current implementation looks like
Two-layer structure (in PR #106):
SEPtoken.sample_sequence(rng, max_len)produces a grammar-valid sequence.TOKEN_TO_CODEdict (e.g.SEP→"GRAMMAR//SEP",program_A[i]→"GRAMMAR//A//{i}"), plusbuild_grammar_meds_dataset(out_dir, task_labels_dir, *, n_train, n_tuning, n_held_out, seed, max_len)that:prediction_timestrictly between the SEP event and the next event (avoids an off-by-one fromside="right"search_sorted — see MEDS_EIC_AR#111)MEDSDatasetwith properdata_shards,code_metadata,subject_splits,task_labelsand calls.write(out_dir)Plus an FSM walker
prompt_grammar_tokens_by_subjectthat extracts per-subject prompt grammar tokens (for establishing FSM state before evaluating generated continuations).What would be nice upstream
GrammarMEDSGeneratorclass parameterized over:MEDSDatasetdirectly that users can.write()or compose with other testing helpers.The PR #106 code is ~330 lines and mostly reusable — could be ported with minor generalization to lift the specific
PROGRAMSgrammar into a constructor argument.No action needed from me; just leaving this pointer so the pattern is discoverable when someone else runs into the same testing problem.