Skip to content

checkpoints reproduce at 0.032 mean AUPRC vs 0.437 published (13.7x gap) #29

@monk1337

Description

@monk1337

I reproduced the evaluation using the exact protocol from reproduce/train.py:

TxData_obj = TxData(data_folder_path='./data')                      
TxData_obj.prepare_split(split=area, seed=seed, no_kg=False)
model = TxGNN(data=TxData_obj, weight_bias_track=False, proj_name='TxGNN_Baselines', exp_name=name, device='cuda:0')
model.load_pretrained(ckpt_path)  # from checkpoints_all_seeds.zip
evaluator = TxEval(model=model)
result = evaluator.eval_disease_centric(disease_idxs='test_set', show_plot=False, verbose=True, save_result=True, return_raw=False, save_name=save_name)

Results across all 5 paper areas × 5 seeds:

┌────────────────────┬───────┬────────────┬───────────┬───────────────────────────────────┐
│        Area        │ Seeds │ AUPRC Mean │ AUPRC Std │          Per-seed values          │
├────────────────────┼───────┼────────────┼───────────┼───────────────────────────────────┤
│ adrenal_gland      │ 5     │ 0.054      │ 0.078     │ 0.210, 0.012, 0.017, 0.013, 0.018 │
├────────────────────┼───────┼────────────┼───────────┼───────────────────────────────────┤
│ anemia             │ 5     │ 0.023      │ 0.019     │ 0.012, 0.010, 0.060, 0.026, 0.009 │
├────────────────────┼───────┼────────────┼───────────┼───────────────────────────────────┤
│ cardiovascular     │ 5     │ 0.026      │ 0.007     │ 0.032, 0.025, 0.020, 0.036, 0.016 │
├────────────────────┼───────┼────────────┼───────────┼───────────────────────────────────┤
│ cell_proliferation │ 5     │ 0.031      │ 0.006     │ 0.035, 0.038, 0.024, 0.036, 0.023 │
├────────────────────┼───────┼────────────┼───────────┼───────────────────────────────────┤
│ mental_health      │ 5     │ 0.024      │ 0.006     │ 0.036, 0.023, 0.019, 0.024, 0.021 │
├────────────────────┼───────┼────────────┼───────────┼───────────────────────────────────┤
│ Mean               │       │ 0.032      │           │                                   │
└────────────────────┴───────┴────────────┴───────────┴───────────────────────────────────┘

Published paper mean: 0.437. The gap is not 0.1–0.2 as originally reported, it's 0.405 absolute (13.7x).

Setup

  • TxGNN pip package, NVIDIA A5000 GPU
  • checkpoints_all_seeds.zip from the reproduce directory
  • config.pkl files are consistent across seeds (n_hid=100, proto=True, sim_measure='all_nodes_profile')
  • Model weights are non-zero and vary across seeds
  • load_pretrained() correctly calls model_initialize(**config) internally

The issue appears to be with the checkpoint weights themselves, not the loading/eval code. Could the authors confirm whether these are the final trained checkpoints? #23 also notes that save/reload gives different results than evaluating directly after training, may this may be related.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions