Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 40 additions & 34 deletions doc/pages/tutorials/12_federated_learning.rst
Original file line number Diff line number Diff line change
Expand Up @@ -764,8 +764,8 @@ QQ plots and evaluation metrics
<th>EXPV</th>
<th>MACE</th>
<th>MAPE</th>
<th>MLL</th>
<th>MSLL</th>
<th>NLL</th>
<th>R2</th>
<th>RMSE</th>
<th>Rho</th>
Expand Down Expand Up @@ -793,47 +793,47 @@ QQ plots and evaluation metrics
<tr>
<th>baseline</th>
<th>WM-hypointensities</th>
<td>0.360218</td>
<td>0.037037</td>
<td>0.341988</td>
<td>-0.320961</td>
<td>0.798308</td>
<td>0.357200</td>
<td>485.243446</td>
<td>0.490306</td>
<td>1.828886e-14</td>
<td>0.642800</td>
<td>0.967453</td>
<td>0.359381</td>
<td>0.124935</td>
<td>0.342213</td>
<td>0.798763</td>
<td>-0.320506</td>
<td>0.356309</td>
<td>485.579657</td>
<td>0.491113</td>
<td>1.633169e-14</td>
<td>0.643691</td>
<td>0.967511</td>
</tr>
<tr>
<th>Aggregated (extend)</th>
<th>WM-hypointensities</th>
<td>0.369571</td>
<td>0.038889</td>
<td>0.322612</td>
<td>-0.330280</td>
<td>0.854522</td>
<td>0.369446</td>
<td>480.599146</td>
<td>0.495714</td>
<td>8.515981e-15</td>
<td>0.630554</td>
<td>0.961769</td>
<td>0.351299</td>
<td>0.132170</td>
<td>0.329932</td>
<td>0.868492</td>
<td>-0.316311</td>
<td>0.351281</td>
<td>487.472645</td>
<td>0.462524</td>
<td>7.558404e-13</td>
<td>0.648719</td>
<td>0.958050</td>
</tr>
<tr>
<th>Aggregated (transfer)</th>
<th>WM-hypointensities</th>
<td>0.307336</td>
<td>0.049630</td>
<td>0.354095</td>
<td>-0.273420</td>
<td>0.911382</td>
<td>0.307070</td>
<td>503.809550</td>
<td>0.405179</td>
<td>6.102418e-10</td>
<td>0.692930</td>
<td>0.946122</td>
<td>0.309515</td>
<td>0.135989</td>
<td>0.356217</td>
<td>0.909280</td>
<td>-0.275522</td>
<td>0.308878</td>
<td>503.151987</td>
<td>0.405334</td>
<td>6.002990e-10</td>
<td>0.691122</td>
<td>0.945963</td>
</tr>
</tbody>
</table>
Expand All @@ -844,6 +844,12 @@ QQ plots and evaluation metrics
Conclusions
-----------

All the models perform very similarly. So the FL workflow, where the
data are different locations. performs as good as the baseline workflow,
where all the data are in one location.

In more detail:

Centile plots
~~~~~~~~~~~~~

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
127 changes: 87 additions & 40 deletions doc/pages/tutorials/13_evaluation_metrics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Two families of metrics
| | (median | EXPV, Rho |
| | prediction) | |
+------------------------+-----------------+---------------------------+
| **Probabilistic** | Full predicted | MACE, MSLL, NLL, ShapiroW |
| **Probabilistic** | Full predicted | MACE, MSLL, MLL, ShapiroW |
| | distribution | |
| | (``logp``, | |
| | centiles, | |
Expand Down Expand Up @@ -66,7 +66,7 @@ R² — Coefficient of determination
.. math:: R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}} = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2}

R² answers the question: *how much better is my model than simply always
predicting the mean?*
predicting the* **mean**\ *?*

Unlike EXPV, R² is penalized by systematic mean shifts.

Expand All @@ -79,14 +79,17 @@ Unlike EXPV, R² is penalized by systematic mean shifts.
EXPV — Explained variance
~~~~~~~~~~~~~~~~~~~~~~~~~

.. math:: \text{EXPV} = 1 - \frac{\text{Var}(y - \hat{y})}{\text{Var}(y)}
.. math:: \text{EXPV} = 1 - \frac{\text{Var}(y - \hat{y} - \overline{(y - \hat{y})})}{\text{Var}(y)}

Similar to R², but it measures the variance of the *residuals* rather
than the sum of squared residuals. The key difference: EXPV is not
penalized by systematic mean shifts. If your model consistently over- or
under-predicts by a constant offset, R² will be lower than EXPV.
Similar to R², but it measures how much of the **variance** in the true
values is explained by the model, after removing any systematic mean
offset from the residuals.

- Range: 0 to 1 — higher is better
- A score of 1 means the model perfectly explains the variance in the
data
- A score of 0 means the model explains no more variance than simply
predicting the mean

--------------

Expand Down Expand Up @@ -160,75 +163,115 @@ Probabilistic metrics

--------------

NLLNegative log likelihood (also called mean log loss)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
MLLMean log loss
~~~~~~~~~~~~~~~~~~~

.. math:: \text{NLL} = -\frac{1}{n} \sum_i \log p(y_i \mid \text{model})
.. math:: \text{MLL} = -\frac{1}{n} \sum_i \log p(y \mid \mathcal{D}, x_*)

Measures how “surprised” the model is by the actual data, on average.
Note: In earlier PCNtoolkit releases, this metric was called ``NLL``
(Negative Log Likelihood). It is now named ``MLL`` to match the
literature and avoid confusion with the different ``NLL`` used
internally for BLR hyperparameter estimation.

- Range: 0 to ∞
- lower is better
Where: - :math:`y`: the test or training response variable. We typically
select the test set here, to see how well the normative model fitted on
training data generalises to test set. - :math:`\mathcal{D}`: the
training dataset used to fit the model - :math:`x_*`: the test covariate
- :math:`p(y_i \mid \mathcal{D}, x_*)`: the probability the model
assigns to the true value given the test input

**Implementation:**

.. code:: python
Measures how “surprised” the model is by the data y, on average.

nll = -np.mean(data['logp'].values)
- Range: 0 to ∞
- lower is better

..

⚠️ **Important:** NLL is an **absolute** quantity that is
⚠️ **Important:** MLL is an **absolute** quantity that is
scale-dependent (it depends on the units and variance of the response
variable). This makes it difficult to interpret in isolation. To
compare models meaningfully, use **MSLL** instead, which normalizes
NLL against a baseline.
MLL against a baseline.

This metric is adopted from `Section 2.5 of Gaussian Processes for
Machine Learning book by C. E. Rasmussen & C. K. I.
Williams <https://gaussianprocess.org/gpml/chapters/RW.pdf#page=27>`__.

--------------

MSLL — Mean standardized log loss
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. math:: \text{MSLL} = \underbrace{-\frac{1}{n}\sum_i \log p(y_i \mid \text{model})}_{\text{MLL}_{\text{model}}} - \underbrace{\left(-\frac{1}{n}\sum_i \log p(y_i \mid \text{baseline})\right)}_{\text{MLL}_{\text{null}}}
.. math:: \text{MSLL} = \underbrace{-\frac{1}{n}\sum_i \log p(y \mid \mathcal{D}, x_*)}_{\text{MLL}_{\text{model}}} - \underbrace{\left(-\frac{1}{n}\sum_i \log \mathcal{N}\!\left(y \mid \bar{y},\, \hat{\sigma}^2\right)\right)}_{\text{MLL}_{\text{Gaussian baseline}}}

MSLL is a relative metric. It compares the model’s log loss against a
Gaussian baseline. The “standardized” in the name refers to this
where the Gaussian baseline fits a single normal distribution to the
training responses: - :math:`\bar{y} = \frac{1}{n}\sum_i y_i` — training
sample mean -
:math:`\hat{\sigma}^2 = \frac{1}{n}\sum_i (y_i - \bar{y})^2` — training
sample variance

MSLL is a relative metric. It compares the model’s mean log loss against
a Gaussian baseline. The “standardized” in the name refers to this
subtraction.

======== ==================================================
======== ============================================
Value Meaning
======== ==================================================
MSLL < 0 Model beats the baseline
MSLL = 0 Model is equivalent to the naive Gaussian baseline
MSLL > 0 Model is worse than the baseline
======== ==================================================
======== ============================================
MSLL < 0 Model beats the Gaussian baseline
MSLL = 0 Model is equivalent to the Gaussian baseline
MSLL > 0 Model is worse than the Gaussian baseline
======== ============================================

This metric is adopted from `Section 2.5 of Gaussian Processes for
Machine Learning book by C. E. Rasmussen & C. K. I.
Williams <https://gaussianprocess.org/gpml/chapters/RW.pdf#page=27>`__.

--------------

MACE — Mean absolute centile error
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. math:: \text{MACE} = \frac{1}{|C|} \sum_{c \in C} \left| c - \hat{F}_c \right|

where :math:`\hat{F}_c` is the empirical fraction of subjects whose true
value falls below the predicted :math:`c`-th centile curve.

MACE checks, for each predicted centile level (e.g. the 10th, 25th,
50th, 75th, 95th centile curve), what fraction of subjects actually
falls below it in the data. A perfectly calibrated model has exactly 10%
of subjects below its 10th centile, 25% below its 25th centile, and so
on. MACE averages the absolute deviation from this ideal across all
.. math:: \text{MACE} = \frac{1}{b} \sum_{k=1}^{b} \left( \frac{1}{m} \sum_{j=1}^{m} \left| q_j - \frac{\sum_{i=1}^{n} \mathbf{1}\{\hat{q}_{ij} \geq y_i\}}{n} \right| \right)

where: - :math:`b` are the unique combinations of batch effects -
:math:`m` is the number of centiles used for calibration - :math:`q_j`
is the :math:`j`-th target centile level (e.g. 0.05, 0.25, 0.50, 0.75,
0.95) - :math:`\hat{q}_{ij}` is the predicted :math:`j`-th centile value
for the :math:`i`-th subject - :math:`y_i` is the true value for the
:math:`i`-th subject - :math:`n` is the number of subjects in the batch
group - :math:`\mathbf{1}\{\hat{q}_{ij} \geq y_i\}` is an indicator
function that outputs 1 or 0, depending on whether :math:`y_i` lies
below or above its predicted :math:`j`-th centile value, respectively.
So, :math:`\frac{\sum_i \mathbf{1}\{\hat{q}_{ij} \geq y_i\}}{n}` is the
empirical fraction of subjects below the :math:`j`-th centile curve

The maths above might seem complicated. To put simply, the MACE checks,
for each predicted centile level (e.g. the 10th, 25th, 50th, 75th, 95th
centile curve), what fraction of subjects actually falls below it in the
data. A perfectly calibrated model has exactly 10% of subjects below its
10th centile, 25% below its 25th centile, and so on. MACE averages the
absolute deviation from this perfectly calibrated model across all
centile levels.

Important: MACE is averaged across unique combinations of batch effects
(e.g., site and sex combinations) and each combination contributes
equally. This means small groups have the same influence as large
groups, and hence they may add disproportionate amount of noise to MACE.

- MACE values close to 0 indicate the predicted centile curves closely
match the empirical distribution of the data.
match the distribution of the data.

**Connection to the QQ plot:** The QQ plot is the “uncompressed” version
of MACE. Each point on the QQ plot corresponds to MACE at a specific
quantile level. Systematic deviations from the diagonal (e.g. an S-curve
or U-curve) indicate where along the distribution calibration breaks
down - information that MACE collapses into a single number.

This metric is adopted from *equation 4* of this paper: > Zamanzadeh,
M., Verduyn, Y., de Boer, A. et al. Normative modeling of MEG brain
oscillations across the human lifespan. Commun Biol (2026).
https://doi.org/10.1038/s42003-026-09825-2

--------------

ShapiroW — Shapiro–Wilk W statistic on Z-scores
Expand Down Expand Up @@ -272,6 +315,10 @@ Z-scores.
| | structure |
+------------------------+---------------------------------------------+

You can read more about the Shapiro–Wilk test in
`this <https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test>`__
wikipedia page.

Summary table
-------------

Expand All @@ -290,7 +337,7 @@ Summary table
+------------+-----------------+---------------+--------------+-----------+
| MAPE | Point | Y, Yhat | Lower | ≥ 0 |
+------------+-----------------+---------------+--------------+-----------+
| NLL | Probabilistic | logp | Lower | ≥ 0 |
| MLL | Probabilistic | logp | Lower | ≥ 0 |
+------------+-----------------+---------------+--------------+-----------+
| MSLL | Probabilistic | logp, | Lower | unbounded |
| | | baseline_logp | (negative) | |
Expand Down
160 changes: 82 additions & 78 deletions examples/12_federated_learning.ipynb

Large diffs are not rendered by default.

Loading
Loading