matchminer-ai-training

Code for training the MatchMiner-AI pipeline. If you have access to a Linux machine with H100 x 8 GPUs and about a week to spare, you can replicate training by:

Making sure your machine can compile CUDA things:

sudo apt update
sudo apt install build-essential python3-dev nvidia-cuda-toolkit

Installing uv (https://docs.astral.sh/uv/getting-started/installation/)

Making and activating a venv:

uv venv mmai --python=3.12
source mmai/bin/activate

Pulling this code and installing dependencies:

git clone https://github.com/kenlkehl/matchminer-ai-training
cd matchminer-ai-training
uv pip install -r requirements.txt

Running the train_all.sh script:

bash train_all.sh

Note: Training makes heavy use of multiple instances of vllm for efficient parallelized inference. This sometimes causes errors related to race conditions on compile. We try to mitigate this by pre-compiling these at the beginning of the script, but errors during training may still occur and require restarting the script from the last completed step.

Also note: At the time of release (December 2025), we were encountering challenges with gibberish output from gpt-oss-120b when run with vllm on more than one RTX PRO 6000 GPU. Inference on a single RTX PRO 6000 seemed to work well, though.

Original framework and logic were implemented manually; parallelization was vibe-coded with Gemini 2.5 pro and Claude 4.5 Sonnet.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
0a_parse_ctgov_json.py		0a_parse_ctgov_json.py
0b_create_trial_spaces.py		0b_create_trial_spaces.py
0c_sample_trial_spaces.py		0c_sample_trial_spaces.py
14_check_boilerplate.py		14_check_boilerplate.py
15_train_modernbert_trial_checker.py		15_train_modernbert_trial_checker.py
16_train_modernbert_boilerplate_checker.py		16_train_modernbert_boilerplate_checker.py
1a_make_synthetic_enrollee_prompts.py		1a_make_synthetic_enrollee_prompts.py
1b_make_synthetic_negative_enrollee_prompts.py		1b_make_synthetic_negative_enrollee_prompts.py
2_make_synthetic_notes_sharded.py		2_make_synthetic_notes_sharded.py
3a_tag_synthetic_patient_notes.py		3a_tag_synthetic_patient_notes.py
3b_train_tiny_oncbert.py		3b_train_tiny_oncbert.py
4_train_tiny_bert_tagger_automodel.py		4_train_tiny_bert_tagger_automodel.py
5_make_patient_longnotes.py		5_make_patient_longnotes.py
6_summarize_longnotes.py		6_summarize_longnotes.py
LICENSE		LICENSE
README.md		README.md
finetune_embedder.py		finetune_embedder.py
llm_check_trials.py		llm_check_trials.py
make_top_matches.py		make_top_matches.py
requirements.txt		requirements.txt
train_all.sh		train_all.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

matchminer-ai-training

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

matchminer-ai-training

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages