-
Notifications
You must be signed in to change notification settings - Fork 69
checkpoints reproduce at 0.032 mean AUPRC vs 0.437 published (13.7x gap) #29
Copy link
Copy link
Open
Description
I reproduced the evaluation using the exact protocol from reproduce/train.py:
TxData_obj = TxData(data_folder_path='./data')
TxData_obj.prepare_split(split=area, seed=seed, no_kg=False)
model = TxGNN(data=TxData_obj, weight_bias_track=False, proj_name='TxGNN_Baselines', exp_name=name, device='cuda:0')
model.load_pretrained(ckpt_path) # from checkpoints_all_seeds.zip
evaluator = TxEval(model=model)
result = evaluator.eval_disease_centric(disease_idxs='test_set', show_plot=False, verbose=True, save_result=True, return_raw=False, save_name=save_name)Results across all 5 paper areas × 5 seeds:
┌────────────────────┬───────┬────────────┬───────────┬───────────────────────────────────┐
│ Area │ Seeds │ AUPRC Mean │ AUPRC Std │ Per-seed values │
├────────────────────┼───────┼────────────┼───────────┼───────────────────────────────────┤
│ adrenal_gland │ 5 │ 0.054 │ 0.078 │ 0.210, 0.012, 0.017, 0.013, 0.018 │
├────────────────────┼───────┼────────────┼───────────┼───────────────────────────────────┤
│ anemia │ 5 │ 0.023 │ 0.019 │ 0.012, 0.010, 0.060, 0.026, 0.009 │
├────────────────────┼───────┼────────────┼───────────┼───────────────────────────────────┤
│ cardiovascular │ 5 │ 0.026 │ 0.007 │ 0.032, 0.025, 0.020, 0.036, 0.016 │
├────────────────────┼───────┼────────────┼───────────┼───────────────────────────────────┤
│ cell_proliferation │ 5 │ 0.031 │ 0.006 │ 0.035, 0.038, 0.024, 0.036, 0.023 │
├────────────────────┼───────┼────────────┼───────────┼───────────────────────────────────┤
│ mental_health │ 5 │ 0.024 │ 0.006 │ 0.036, 0.023, 0.019, 0.024, 0.021 │
├────────────────────┼───────┼────────────┼───────────┼───────────────────────────────────┤
│ Mean │ │ 0.032 │ │ │
└────────────────────┴───────┴────────────┴───────────┴───────────────────────────────────┘
Published paper mean: 0.437. The gap is not 0.1–0.2 as originally reported, it's 0.405 absolute (13.7x).
Setup
- TxGNN pip package, NVIDIA A5000 GPU
- checkpoints_all_seeds.zip from the reproduce directory
- config.pkl files are consistent across seeds (n_hid=100, proto=True, sim_measure='all_nodes_profile')
- Model weights are non-zero and vary across seeds
- load_pretrained() correctly calls model_initialize(**config) internally
The issue appears to be with the checkpoint weights themselves, not the loading/eval code. Could the authors confirm whether these are the final trained checkpoints? #23 also notes that save/reload gives different results than evaluating directly after training, may this may be related.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels