Fix evaluation metrics: NLL and MACE by contsili · Pull Request #410 · predictive-clinical-neuroscience/PCNtoolkit

contsili · 2026-04-08T15:48:56Z

@AuguB since i made changes on already existing evaluation metrics, could you review this?

what I did:

Rename NLL (negative log-likelihood) → MLL (mean log-loss)

Following Rasmussen & Williams book I renamed NLL to MLL to better reflect what the metric computes. This also clarifies the relationship between MLL and MSLL (mean standardised log-loss), where:

MSLL=MLL_model−MLL_baseline

Also the previous name NLL can create confusion with the NLL used internally for hyperparameter estimation in BLR, which is a different quantity.

Add MACE averaging

MACE remained as it is, the only difference is that it is now averaged across each combination of batch effect, as defined in Zamanzadeh et al. (2026).

eg before if was average(all subjects) which weights each subject equally. Now we do average(site1_male + site1_female + site2_male + site2_female) which weights each group equally regardless of size. The two approaches would be the same only if all combinations have the same number of subjects (which is not true in general)

After i tested it in the fcon1000 dataset: MACE tends to perform a bit worse than before since averaging across combinations of batch effects, gives smaller batch effects groups (eg a site with only a few subjects) equal weight to larger groups (eg a site with only a lot of subjects).

I added mathematical definitions for all evaluation metrics in the website (see 13_evaluation_metrics.rst)

This agrees with the Rasmussen and Williams book and also separates this evaluation metric from the NLL used when estimating hyperpriors in the BLR

- MACE is now updated to work as in the paper Zamanzadeh et al. (2026). - MACE is now averaged across each unique batch effect.

- Updated MACE - Updated conclusion in 12_federated_learning

- Introduce shared evaluator test helpers and new MACE tests, and update the MSLL test to reuse the fixture. - Add test_mace.py: tests for the MACE metric

AuguB · 2026-04-16T17:02:36Z

Looks good, but here are a few suggestions.

We could report both the weighted and the non-weighted MACE as metrics. I agree that the implementation should follow the paper, but a clear downside is that small sites introduce a disproportionate amount of noise.
Something very similar to the product of unique batch effect level combinations happens here. Perhaps a helper function can be created inside the NormData class.
I believe one of today's merges contained another test data synthesization function, so with this one here and NormativeModel.synthesize, we have three. The NormativeModel.synthesize function is clearly distinct from the other two, as it samples from the learned distribution, but the other two seem susceptible to consolidation. Let's try to create a single proper definition (maybe as a classmethod of NormData) that replaces these two.

contsili · 2026-04-20T09:45:51Z

We could report both the weighted and the non-weighted MACE as metrics. I agree that the implementation should follow the paper, but a clear downside is that small sites introduce a disproportionate amount of noise.

Good point, I agree that having both makes it more complete.

But then from a users perspective: having two MACEs won't be too much for the users? I suggest keeping the official published one from Mosi and specify in the tutorial I add in the website that it is a known downside that small sites introduce a disproportionate amount of noise. We already have quite some evaluation metrics and I am afraid people will not try to understand all of them. And in the end they might choose the ones that perform better to put in their paper eg in this case the would usually choose the non-weighted MACE.

@amarquand what do you think about that?

Something very similar to the product of unique batch effect level combinations happens here. Perhaps a helper function can be created inside the NormData class.

Good idea, I will look into it

I believe one of today's merges contained another test data synthesization function, so with this one here and NormativeModel.synthesize, we have three. The NormativeModel.synthesize function is clearly distinct from the other two, as it samples from the learned distribution, but the other two seem susceptible to consolidation. Let's try to create a single proper definition (maybe as a classmethod of NormData) that replaces these two.

yes I made an issue to track that #422

amarquand · 2026-04-23T09:24:14Z

Hi All, I think I agree that we should follow the paper for the MACE and acknowledge the limitations in the documentation. It is just a metric after all.

Another option could be to specify an option for users to manually computed a weighted version (e.g. using a keyword arg). But that might muddy the waters further.

contsili · 2026-04-24T07:44:46Z

Another option could be to specify an option for users to manually computed a weighted version (e.g. using a keyword arg). But that might muddy the waters further.

Currently the metrics are computed automatically after running fit_predict, so no keyword arg can be specified for them. So i think this indeed would go against the workflow we have currently chosen for the toolkit and confuse users.

I will keep the MACE as it is in the paper and acknowledge the limitations in the documentation

smkia · 2026-04-24T09:36:32Z

I agree that it's better to stick with the original MACE definition. Indeed, small sites may add noise, but that's the price we usually pay for lack of data. Another solution could be adding a new metric for which we use the median instead of the mean when summarizing across groups. We can call it MEACE (MEdian Absolute Centile Error).

Can we have this patch approved and released as soon as possible?

contsili · 2026-04-24T09:51:16Z

I agree that it's better to stick with the original MACE definition. Indeed, small sites may add noise, but that's the price we usually pay for lack of data. Another solution could be adding a new metric for which we use the median instead of the mean when summarizing across groups. We can call it MEACE (MEdian Absolute Centile Error).

for now I would not add more metrics

Can we have this patch approved and released as soon as possible?

we released 3 weeks ago and i was planning to release again in the summer. However, I can discuss it with @amarquand to do a release in the next month

- warn that MACE can get noise from small groups - note that MLL name was updated

- update the mace test to use the new helper function - update blr.py and evaluator.py to use the new helper function

contsili added 5 commits April 8, 2026 14:31

fix - rename NLL to MLL

00857c6

This agrees with the Rasmussen and Williams book and also separates this evaluation metric from the NLL used when estimating hyperpriors in the BLR

cos - pep8

bd0455f

fix - refactor formulas for EXPV, MLL, MSLL in the rst

1ea597a

enh - refactor MACE

88b32ff

- MACE is now updated to work as in the paper Zamanzadeh et al. (2026). - MACE is now averaged across each unique batch effect.

cos - pep8

dc17367

contsili changed the title ~~Maint/fix evaluation metrics~~ Fixevaluation metrics: NLL and MACE Apr 9, 2026

contsili requested a review from AuguB April 9, 2026 11:43

contsili assigned AuguB Apr 9, 2026

contsili changed the title ~~Fixevaluation metrics: NLL and MACE~~ Fix evaluation metrics: NLL and MACE Apr 9, 2026

contsili added 4 commits April 9, 2026 16:20

enh - refactor comments

112a44f

fix - refactor MACE to average across combinations of batch effects

a0b81ab

enh - update tutorials

3ca0de1

- Updated MACE - Updated conclusion in 12_federated_learning

enh - add evaluator test fixtures and MACE tests

e3c57c4

- Introduce shared evaluator test helpers and new MACE tests, and update the MSLL test to reuse the fixture. - Add test_mace.py: tests for the MACE metric

Merge branch 'dev' into maint/fix_evaluation_metrics

51b84d5

contsili added 10 commits April 24, 2026 12:26

enh - add info about MLL and MACE in the website

8306074

- warn that MACE can get noise from small groups - note that MLL name was updated

cos - note in the code that MLL name was updated

0a65454

enh - create a helper function to iterate in be combinations

ad76cdb

- update the mace test to use the new helper function - update blr.py and evaluator.py to use the new helper function

fix - small fixes after copilot review

1e57139

enh - explain better the usage of yield

65ceae2

cos - comments and pep8

74f2dde

enh - rename file

214870a

fix - imports

37f0a77

enh - add deprecation warning for NLL

2369882

enh - add rename deprecation warning

0b5101a

contsili merged commit 78cf982 into dev Apr 27, 2026
1 check passed

contsili deleted the maint/fix_evaluation_metrics branch April 27, 2026 13:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix evaluation metrics: NLL and MACE#410

Fix evaluation metrics: NLL and MACE#410
contsili merged 20 commits intodevfrom
maint/fix_evaluation_metrics

contsili commented Apr 8, 2026 •

edited

Loading

Uh oh!

AuguB commented Apr 16, 2026

Uh oh!

contsili commented Apr 20, 2026 •

edited

Loading

Uh oh!

amarquand commented Apr 23, 2026

Uh oh!

contsili commented Apr 24, 2026 •

edited

Loading

Uh oh!

smkia commented Apr 24, 2026

Uh oh!

contsili commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

contsili commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AuguB commented Apr 16, 2026

Uh oh!

contsili commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amarquand commented Apr 23, 2026

Uh oh!

contsili commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smkia commented Apr 24, 2026

Uh oh!

contsili commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

contsili commented Apr 8, 2026 •

edited

Loading

contsili commented Apr 20, 2026 •

edited

Loading

contsili commented Apr 24, 2026 •

edited

Loading