checkpoints reproduce at 0.032 mean AUPRC vs 0.437 published (13.7x gap)

I reproduced the evaluation using the exact protocol from `reproduce/train.py`:                                                                                            
                                                                                                                                                                             
  ```python                                                                                                                                                                  
  TxData_obj = TxData(data_folder_path='./data')                      
  TxData_obj.prepare_split(split=area, seed=seed, no_kg=False)
  model = TxGNN(data=TxData_obj, weight_bias_track=False, proj_name='TxGNN_Baselines', exp_name=name, device='cuda:0')
  model.load_pretrained(ckpt_path)  # from checkpoints_all_seeds.zip
  evaluator = TxEval(model=model)
  result = evaluator.eval_disease_centric(disease_idxs='test_set', show_plot=False, verbose=True, save_result=True, return_raw=False, save_name=save_name)
```

Results across all 5 paper areas × 5 seeds:

  ```
┌────────────────────┬───────┬────────────┬───────────┬───────────────────────────────────┐
  │        Area        │ Seeds │ AUPRC Mean │ AUPRC Std │          Per-seed values          │
  ├────────────────────┼───────┼────────────┼───────────┼───────────────────────────────────┤
  │ adrenal_gland      │ 5     │ 0.054      │ 0.078     │ 0.210, 0.012, 0.017, 0.013, 0.018 │
  ├────────────────────┼───────┼────────────┼───────────┼───────────────────────────────────┤
  │ anemia             │ 5     │ 0.023      │ 0.019     │ 0.012, 0.010, 0.060, 0.026, 0.009 │
  ├────────────────────┼───────┼────────────┼───────────┼───────────────────────────────────┤
  │ cardiovascular     │ 5     │ 0.026      │ 0.007     │ 0.032, 0.025, 0.020, 0.036, 0.016 │
  ├────────────────────┼───────┼────────────┼───────────┼───────────────────────────────────┤
  │ cell_proliferation │ 5     │ 0.031      │ 0.006     │ 0.035, 0.038, 0.024, 0.036, 0.023 │
  ├────────────────────┼───────┼────────────┼───────────┼───────────────────────────────────┤
  │ mental_health      │ 5     │ 0.024      │ 0.006     │ 0.036, 0.023, 0.019, 0.024, 0.021 │
  ├────────────────────┼───────┼────────────┼───────────┼───────────────────────────────────┤
  │ Mean               │       │ 0.032      │           │                                   │
  └────────────────────┴───────┴────────────┴───────────┴───────────────────────────────────┘
```

Published paper mean: 0.437. The gap is not 0.1–0.2 as originally reported, it's 0.405 absolute (13.7x).

  Setup

  - TxGNN pip package, NVIDIA A5000 GPU
  - checkpoints_all_seeds.zip from the reproduce directory
  - config.pkl files are consistent across seeds (n_hid=100, proto=True, sim_measure='all_nodes_profile')
  - Model weights are non-zero and vary across seeds
  - load_pretrained() correctly calls model_initialize(**config) internally

The issue appears to be with the checkpoint weights themselves, not the loading/eval code. Could the authors confirm whether these are the final trained checkpoints? #23 also notes that save/reload gives different results than evaluating directly after training, may this may be related.
 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

checkpoints reproduce at 0.032 mean AUPRC vs 0.437 published (13.7x gap) #29

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

checkpoints reproduce at 0.032 mean AUPRC vs 0.437 published (13.7x gap) #29

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions