M4 supports any PhysioNet dataset. This guide shows how to add your own.
Create a JSON file in m4_data/datasets/:
Example: m4_data/datasets/mimic-iv-ed.json
{
"name": "mimic-iv-ed",
"description": "MIMIC-IV Emergency Department Module",
"file_listing_url": "https://physionet.org/files/mimic-iv-ed/2.2/",
"subdirectories_to_scan": ["ed"],
"primary_verification_table": "mimiciv_ed.edstays",
"requires_authentication": true,
"bigquery_project_id": "physionet-data",
"bigquery_dataset_ids": ["mimiciv_ed"],
"modalities": ["TABULAR"],
"schema_mapping": {"ed": "mimiciv_ed"},
"bigquery_schema_mapping": {"mimiciv_ed": "mimiciv_ed"}
}Then initialize:
m4 init mimic-iv-ed --src /path/to/your/csv/files| Field | Required | Description |
|---|---|---|
name |
Yes | Unique identifier (used in m4 use <name>) |
description |
Yes | Human-readable description |
file_listing_url |
No | PhysioNet URL for auto-download (demo datasets only) |
subdirectories_to_scan |
No | Subdirs containing CSV files (e.g., ["hosp", "icu"]) |
primary_verification_table |
Yes | Table to verify initialization succeeded |
requires_authentication |
No | true if PhysioNet credentialing required |
bigquery_project_id |
No | GCP project for BigQuery access |
bigquery_dataset_ids |
No | BigQuery dataset IDs |
modalities |
No | Data types in this dataset (see below). Defaults to ["TABULAR"] |
schema_mapping |
No | Maps filesystem subdirectories to canonical schema names (see below) |
bigquery_schema_mapping |
No | Maps canonical schema names to BigQuery dataset IDs (see below) |
| Modality | Description | Available Tools |
|---|---|---|
TABULAR |
Structured tables (labs, demographics, vitals, etc.) | get_database_schema, get_table_info, execute_query |
NOTES |
Clinical notes and discharge summaries | search_notes, get_note, list_patient_notes |
Tools are filtered based on the dataset's declared modalities. If not specified, defaults to ["TABULAR"].
M4 uses canonical schema.table names (e.g., mimiciv_hosp.patients) that work identically on both DuckDB and BigQuery backends. The schema_mapping and bigquery_schema_mapping fields control how these canonical names are constructed.
schema_mapping maps filesystem subdirectories to canonical schema names. When DuckDB creates views, files from each subdirectory are placed into the corresponding schema:
{
"schema_mapping": {
"hosp": "mimiciv_hosp",
"icu": "mimiciv_icu"
}
}With this mapping, a file at hosp/patients.csv becomes queryable as mimiciv_hosp.patients.
For datasets where all files are in the root directory (no subdirectories), use an empty string key:
{
"schema_mapping": {
"": "eicu_crd"
}
}bigquery_schema_mapping maps canonical schema names to BigQuery dataset IDs. This allows the BigQuery backend to translate canonical names to the actual GCP dataset names:
{
"bigquery_schema_mapping": {
"mimiciv_hosp": "mimiciv_hosp",
"mimiciv_icu": "mimiciv_icu"
}
}With this, a query for mimiciv_hosp.patients is rewritten to physionet-data.mimiciv_hosp.patients on BigQuery.
Custom datasets without schema_mapping still work — tables will be created with flat names in the main schema (backward-compatible behavior).
When you run m4 init <dataset>:
- Download (if
file_listing_urlexists and files missing) - Convert CSV.gz files to Parquet format
- Create DuckDB views over the Parquet files
- Verify by querying
primary_verification_table
M4 organizes data like this:
m4_data/
├── datasets/ # Custom JSON definitions
│ └── my-dataset.json
├── raw_files/ # Downloaded CSV.gz files
│ └── my-dataset/
│ └── *.csv.gz
├── parquet/ # Converted Parquet files
│ └── my-dataset/
│ └── *.parquet
└── databases/ # DuckDB databases
└── my_dataset.duckdb
If you already have CSV files (either .csv or .csv.gz), point to them with --src:
m4 init my-dataset --src /path/to/csvsM4 will:
- Convert CSV/CSV.gz files to Parquet format
- Create DuckDB views
- Set the dataset as active
For datasets requiring PhysioNet credentials (most full datasets):
- Get credentialed access on PhysioNet
- Download manually using wget:
wget -r -N -c -np --user YOUR_USERNAME --ask-password \ https://physionet.org/files/dataset-name/version/ \ -P m4_data/raw_files/dataset-name
- Initialize:
m4 init dataset-name
For more control, register datasets in Python:
from m4.core.datasets import DatasetDefinition, DatasetRegistry, Modality
my_dataset = DatasetDefinition(
name="my-custom-dataset",
description="My custom clinical dataset",
primary_verification_table="patients",
modalities=frozenset({Modality.TABULAR}),
)
DatasetRegistry.register(my_dataset)- Start with demo data: Test your setup with
mimic-iv-demofirst - Check table names: Use
get_database_schematool to see available tables - Verify initialization:
m4 statusshows if Parquet and DuckDB are ready - Force reinitialize:
m4 init <dataset> --forcerecreates the database