Skip to content

Fix evaluation metrics: NLL and MACE#410

Merged
contsili merged 20 commits intodevfrom
maint/fix_evaluation_metrics
Apr 27, 2026
Merged

Fix evaluation metrics: NLL and MACE#410
contsili merged 20 commits intodevfrom
maint/fix_evaluation_metrics

Conversation

@contsili
Copy link
Copy Markdown
Collaborator

@contsili contsili commented Apr 8, 2026

@AuguB since i made changes on already existing evaluation metrics, could you review this?

what I did:

  1. Rename NLL (negative log-likelihood) → MLL (mean log-loss)

Following Rasmussen & Williams book I renamed NLL to MLL to better reflect what the metric computes. This also clarifies the relationship between MLL and MSLL (mean standardised log-loss), where:

MSLL=MLL_model−MLL_baseline

Also the previous name NLL can create confusion with the NLL used internally for hyperparameter estimation in BLR, which is a different quantity.

  1. Add MACE averaging

MACE remained as it is, the only difference is that it is now averaged across each combination of batch effect, as defined in Zamanzadeh et al. (2026).

eg before if was average(all subjects) which weights each subject equally. Now we do average(site1_male + site1_female + site2_male + site2_female) which weights each group equally regardless of size. The two approaches would be the same only if all combinations have the same number of subjects (which is not true in general)

After i tested it in the fcon1000 dataset: MACE tends to perform a bit worse than before since averaging across combinations of batch effects, gives smaller batch effects groups (eg a site with only a few subjects) equal weight to larger groups (eg a site with only a lot of subjects).

  1. I added mathematical definitions for all evaluation metrics in the website (see 13_evaluation_metrics.rst)

contsili added 5 commits April 8, 2026 14:31
This agrees with the Rasmussen and Williams book and also separates this evaluation metric from the NLL used when estimating hyperpriors in the BLR
- MACE is now updated to work as in the paper Zamanzadeh et al. (2026).
- MACE is now averaged across each unique batch effect.
@contsili contsili changed the title Maint/fix evaluation metrics Fixevaluation metrics: NLL and MACE Apr 9, 2026
@contsili contsili requested a review from AuguB April 9, 2026 11:43
@contsili contsili changed the title Fixevaluation metrics: NLL and MACE Fix evaluation metrics: NLL and MACE Apr 9, 2026
contsili added 4 commits April 9, 2026 16:20
- Updated MACE
- Updated conclusion in 12_federated_learning
- Introduce shared evaluator test helpers and new MACE tests, and update the MSLL test to reuse the fixture.

- Add test_mace.py: tests for the MACE metric
@AuguB
Copy link
Copy Markdown
Collaborator

AuguB commented Apr 16, 2026

Looks good, but here are a few suggestions.

  1. We could report both the weighted and the non-weighted MACE as metrics. I agree that the implementation should follow the paper, but a clear downside is that small sites introduce a disproportionate amount of noise.

  2. Something very similar to the product of unique batch effect level combinations happens here. Perhaps a helper function can be created inside the NormData class.

  3. I believe one of today's merges contained another test data synthesization function, so with this one here and NormativeModel.synthesize, we have three. The NormativeModel.synthesize function is clearly distinct from the other two, as it samples from the learned distribution, but the other two seem susceptible to consolidation. Let's try to create a single proper definition (maybe as a classmethod of NormData) that replaces these two.

@contsili
Copy link
Copy Markdown
Collaborator Author

contsili commented Apr 20, 2026

  1. We could report both the weighted and the non-weighted MACE as metrics. I agree that the implementation should follow the paper, but a clear downside is that small sites introduce a disproportionate amount of noise.

Good point, I agree that having both makes it more complete.

But then from a users perspective: having two MACEs won't be too much for the users? I suggest keeping the official published one from Mosi and specify in the tutorial I add in the website that it is a known downside that small sites introduce a disproportionate amount of noise. We already have quite some evaluation metrics and I am afraid people will not try to understand all of them. And in the end they might choose the ones that perform better to put in their paper eg in this case the would usually choose the non-weighted MACE.

@amarquand what do you think about that?

  1. Something very similar to the product of unique batch effect level combinations happens here. Perhaps a helper function can be created inside the NormData class.

Good idea, I will look into it

  1. I believe one of today's merges contained another test data synthesization function, so with this one here and NormativeModel.synthesize, we have three. The NormativeModel.synthesize function is clearly distinct from the other two, as it samples from the learned distribution, but the other two seem susceptible to consolidation. Let's try to create a single proper definition (maybe as a classmethod of NormData) that replaces these two.

yes I made an issue to track that #422

@amarquand
Copy link
Copy Markdown
Collaborator

Hi All, I think I agree that we should follow the paper for the MACE and acknowledge the limitations in the documentation. It is just a metric after all.

Another option could be to specify an option for users to manually computed a weighted version (e.g. using a keyword arg). But that might muddy the waters further.

@contsili
Copy link
Copy Markdown
Collaborator Author

contsili commented Apr 24, 2026

Another option could be to specify an option for users to manually computed a weighted version (e.g. using a keyword arg). But that might muddy the waters further.

Currently the metrics are computed automatically after running fit_predict, so no keyword arg can be specified for them. So i think this indeed would go against the workflow we have currently chosen for the toolkit and confuse users.

I will keep the MACE as it is in the paper and acknowledge the limitations in the documentation

@smkia
Copy link
Copy Markdown
Collaborator

smkia commented Apr 24, 2026

I agree that it's better to stick with the original MACE definition. Indeed, small sites may add noise, but that's the price we usually pay for lack of data. Another solution could be adding a new metric for which we use the median instead of the mean when summarizing across groups. We can call it MEACE (MEdian Absolute Centile Error).

Can we have this patch approved and released as soon as possible?

@contsili
Copy link
Copy Markdown
Collaborator Author

I agree that it's better to stick with the original MACE definition. Indeed, small sites may add noise, but that's the price we usually pay for lack of data. Another solution could be adding a new metric for which we use the median instead of the mean when summarizing across groups. We can call it MEACE (MEdian Absolute Centile Error).

for now I would not add more metrics

Can we have this patch approved and released as soon as possible?

we released 3 weeks ago and i was planning to release again in the summer. However, I can discuss it with @amarquand to do a release in the next month

@contsili contsili merged commit 78cf982 into dev Apr 27, 2026
1 check passed
@contsili contsili deleted the maint/fix_evaluation_metrics branch April 27, 2026 13:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants