Conversation
This agrees with the Rasmussen and Williams book and also separates this evaluation metric from the NLL used when estimating hyperpriors in the BLR
- MACE is now updated to work as in the paper Zamanzadeh et al. (2026). - MACE is now averaged across each unique batch effect.
- Updated MACE - Updated conclusion in 12_federated_learning
- Introduce shared evaluator test helpers and new MACE tests, and update the MSLL test to reuse the fixture. - Add test_mace.py: tests for the MACE metric
|
Looks good, but here are a few suggestions.
|
Good point, I agree that having both makes it more complete. But then from a users perspective: having two MACEs won't be too much for the users? I suggest keeping the official published one from Mosi and specify in the tutorial I add in the website that it is a known downside that small sites introduce a disproportionate amount of noise. We already have quite some evaluation metrics and I am afraid people will not try to understand all of them. And in the end they might choose the ones that perform better to put in their paper eg in this case the would usually choose the non-weighted MACE. @amarquand what do you think about that?
Good idea, I will look into it
yes I made an issue to track that #422 |
|
Hi All, I think I agree that we should follow the paper for the MACE and acknowledge the limitations in the documentation. It is just a metric after all. Another option could be to specify an option for users to manually computed a weighted version (e.g. using a keyword arg). But that might muddy the waters further. |
Currently the metrics are computed automatically after running fit_predict, so no keyword arg can be specified for them. So i think this indeed would go against the workflow we have currently chosen for the toolkit and confuse users. I will keep the MACE as it is in the paper and acknowledge the limitations in the documentation |
|
I agree that it's better to stick with the original MACE definition. Indeed, small sites may add noise, but that's the price we usually pay for lack of data. Another solution could be adding a new metric for which we use the median instead of the mean when summarizing across groups. We can call it MEACE (MEdian Absolute Centile Error). Can we have this patch approved and released as soon as possible? |
for now I would not add more metrics
we released 3 weeks ago and i was planning to release again in the summer. However, I can discuss it with @amarquand to do a release in the next month |
- warn that MACE can get noise from small groups - note that MLL name was updated
- update the mace test to use the new helper function - update blr.py and evaluator.py to use the new helper function
@AuguB since i made changes on already existing evaluation metrics, could you review this?
what I did:
Following Rasmussen & Williams book I renamed
NLLtoMLLto better reflect what the metric computes. This also clarifies the relationship between MLL and MSLL (mean standardised log-loss), where:MSLL=MLL_model−MLL_baselineAlso the previous name
NLLcan create confusion with theNLLused internally for hyperparameter estimation in BLR, which is a different quantity.MACE remained as it is, the only difference is that it is now averaged across each combination of batch effect, as defined in Zamanzadeh et al. (2026).
eg before if was average(all subjects) which weights each subject equally. Now we do average(site1_male + site1_female + site2_male + site2_female) which weights each group equally regardless of size. The two approaches would be the same only if all combinations have the same number of subjects (which is not true in general)
After i tested it in the fcon1000 dataset: MACE tends to perform a bit worse than before since averaging across combinations of batch effects, gives smaller batch effects groups (eg a site with only a few subjects) equal weight to larger groups (eg a site with only a lot of subjects).