predictive-clinical-neuroscience · contsili · Apr 27, 2026 · Apr 8, 2026 · Apr 8, 2026 · Apr 8, 2026
diff --git a/doc/pages/tutorials/12_federated_learning.rst b/doc/pages/tutorials/12_federated_learning.rst
@@ -764,8 +764,8 @@ QQ plots and evaluation metrics
           <th>EXPV</th>
           <th>MACE</th>
           <th>MAPE</th>
+          <th>MLL</th>
           <th>MSLL</th>
-          <th>NLL</th>
           <th>R2</th>
           <th>RMSE</th>
           <th>Rho</th>
@@ -793,47 +793,47 @@ QQ plots and evaluation metrics
         <tr>
           <th>baseline</th>
           <th>WM-hypointensities</th>
-          <td>0.360218</td>
-          <td>0.037037</td>
-          <td>0.341988</td>
-          <td>-0.320961</td>
-          <td>0.798308</td>
-          <td>0.357200</td>
-          <td>485.243446</td>
-          <td>0.490306</td>
-          <td>1.828886e-14</td>
-          <td>0.642800</td>
-          <td>0.967453</td>
+          <td>0.359381</td>
+          <td>0.124935</td>
+          <td>0.342213</td>
+          <td>0.798763</td>
+          <td>-0.320506</td>
+          <td>0.356309</td>
+          <td>485.579657</td>
+          <td>0.491113</td>
+          <td>1.633169e-14</td>
+          <td>0.643691</td>
+          <td>0.967511</td>
         </tr>
         <tr>
           <th>Aggregated (extend)</th>
           <th>WM-hypointensities</th>
-          <td>0.369571</td>
-          <td>0.038889</td>
-          <td>0.322612</td>
-          <td>-0.330280</td>
-          <td>0.854522</td>
-          <td>0.369446</td>
-          <td>480.599146</td>
-          <td>0.495714</td>
-          <td>8.515981e-15</td>
-          <td>0.630554</td>
-          <td>0.961769</td>
+          <td>0.351299</td>
+          <td>0.132170</td>
+          <td>0.329932</td>
+          <td>0.868492</td>
+          <td>-0.316311</td>
+          <td>0.351281</td>
+          <td>487.472645</td>
+          <td>0.462524</td>
+          <td>7.558404e-13</td>
+          <td>0.648719</td>
+          <td>0.958050</td>
         </tr>
         <tr>
           <th>Aggregated (transfer)</th>
           <th>WM-hypointensities</th>
-          <td>0.307336</td>
-          <td>0.049630</td>
-          <td>0.354095</td>
-          <td>-0.273420</td>
-          <td>0.911382</td>
-          <td>0.307070</td>
-          <td>503.809550</td>
-          <td>0.405179</td>
-          <td>6.102418e-10</td>
-          <td>0.692930</td>
-          <td>0.946122</td>
+          <td>0.309515</td>
+          <td>0.135989</td>
+          <td>0.356217</td>
+          <td>0.909280</td>
+          <td>-0.275522</td>
+          <td>0.308878</td>
+          <td>503.151987</td>
+          <td>0.405334</td>
+          <td>6.002990e-10</td>
+          <td>0.691122</td>
+          <td>0.945963</td>
         </tr>
       </tbody>
     </table>
@@ -844,6 +844,12 @@ QQ plots and evaluation metrics
 Conclusions
 -----------
 
+All the models perform very similarly. So the FL workflow, where the
+data are different locations. performs as good as the baseline workflow,
+where all the data are in one location.
+
+In more detail:
+
 Centile plots
 ~~~~~~~~~~~~~
 

diff --git a/doc/pages/tutorials/12_federated_learning_files/12_federated_learning_15_0.png b/doc/pages/tutorials/12_federated_learning_files/12_federated_learning_15_0.png
diff --git a/doc/pages/tutorials/12_federated_learning_files/12_federated_learning_19_0.png b/doc/pages/tutorials/12_federated_learning_files/12_federated_learning_19_0.png
diff --git a/doc/pages/tutorials/12_federated_learning_files/12_federated_learning_19_1.png b/doc/pages/tutorials/12_federated_learning_files/12_federated_learning_19_1.png
diff --git a/doc/pages/tutorials/12_federated_learning_files/12_federated_learning_27_0.png b/doc/pages/tutorials/12_federated_learning_files/12_federated_learning_27_0.png
diff --git a/doc/pages/tutorials/12_federated_learning_files/12_federated_learning_27_1.png b/doc/pages/tutorials/12_federated_learning_files/12_federated_learning_27_1.png
diff --git a/doc/pages/tutorials/12_federated_learning_files/12_federated_learning_31_1.png b/doc/pages/tutorials/12_federated_learning_files/12_federated_learning_31_1.png
diff --git a/doc/pages/tutorials/12_federated_learning_files/12_federated_learning_31_3.png b/doc/pages/tutorials/12_federated_learning_files/12_federated_learning_31_3.png
diff --git a/doc/pages/tutorials/12_federated_learning_files/12_federated_learning_31_5.png b/doc/pages/tutorials/12_federated_learning_files/12_federated_learning_31_5.png
diff --git a/doc/pages/tutorials/12_federated_learning_files/12_federated_learning_33_0.png b/doc/pages/tutorials/12_federated_learning_files/12_federated_learning_33_0.png
diff --git a/doc/pages/tutorials/12_federated_learning_files/12_federated_learning_33_1.png b/doc/pages/tutorials/12_federated_learning_files/12_federated_learning_33_1.png
diff --git a/doc/pages/tutorials/12_federated_learning_files/12_federated_learning_33_2.png b/doc/pages/tutorials/12_federated_learning_files/12_federated_learning_33_2.png
diff --git a/doc/pages/tutorials/13_evaluation_metrics.rst b/doc/pages/tutorials/13_evaluation_metrics.rst
@@ -17,7 +17,7 @@ Two families of metrics
 |                        | (median         | EXPV, Rho                 |
 |                        | prediction)     |                           |
 +------------------------+-----------------+---------------------------+
-| **Probabilistic**      | Full predicted  | MACE, MSLL, NLL, ShapiroW |
+| **Probabilistic**      | Full predicted  | MACE, MSLL, MLL, ShapiroW |
 |                        | distribution    |                           |
 |                        | (``logp``,      |                           |
 |                        | centiles,       |                           |
@@ -66,7 +66,7 @@ R² — Coefficient of determination
 .. math:: R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}} = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2}
 
 R² answers the question: *how much better is my model than simply always
-predicting the mean?*
+predicting the* **mean**\ *?*
 
 Unlike EXPV, R² is penalized by systematic mean shifts.
 
@@ -79,14 +79,17 @@ Unlike EXPV, R² is penalized by systematic mean shifts.
 EXPV — Explained variance
 ~~~~~~~~~~~~~~~~~~~~~~~~~
 
-.. math:: \text{EXPV} = 1 - \frac{\text{Var}(y - \hat{y})}{\text{Var}(y)}
+.. math:: \text{EXPV} = 1 - \frac{\text{Var}(y - \hat{y} - \overline{(y - \hat{y})})}{\text{Var}(y)}
 
-Similar to R², but it measures the variance of the *residuals* rather
-than the sum of squared residuals. The key difference: EXPV is not
-penalized by systematic mean shifts. If your model consistently over- or
-under-predicts by a constant offset, R² will be lower than EXPV.
+Similar to R², but it measures how much of the **variance** in the true
+values is explained by the model, after removing any systematic mean
+offset from the residuals.
 
 - Range: 0 to 1 — higher is better
+- A score of 1 means the model perfectly explains the variance in the
+  data
+- A score of 0 means the model explains no more variance than simply
+  predicting the mean
 
 --------------
 
@@ -160,75 +163,115 @@ Probabilistic metrics
 
 --------------
 
-NLL — Negative log likelihood (also called mean log loss)
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+MLL — Mean log loss
+~~~~~~~~~~~~~~~~~~~
 
-.. math:: \text{NLL} = -\frac{1}{n} \sum_i \log p(y_i \mid \text{model})
+.. math:: \text{MLL} = -\frac{1}{n} \sum_i \log p(y \mid \mathcal{D}, x_*)
 
-Measures how “surprised” the model is by the actual data, on average.
+Note: In earlier PCNtoolkit releases, this metric was called ``NLL``
+(Negative Log Likelihood). It is now named ``MLL`` to match the
+literature and avoid confusion with the different ``NLL`` used
+internally for BLR hyperparameter estimation.
 
-- Range: 0 to ∞
-- lower is better
+Where: - :math:`y`: the test or training response variable. We typically
+select the test set here, to see how well the normative model fitted on
+training data generalises to test set. - :math:`\mathcal{D}`: the
+training dataset used to fit the model - :math:`x_*`: the test covariate
+- :math:`p(y_i \mid \mathcal{D}, x_*)`: the probability the model
+assigns to the true value given the test input
 
-**Implementation:**
-
-.. code:: python
+Measures how “surprised” the model is by the data y, on average.
 
-   nll = -np.mean(data['logp'].values)
+- Range: 0 to ∞
+- lower is better
 
 ..
 
-   ⚠️ **Important:** NLL is an **absolute** quantity that is
+   ⚠️ **Important:** MLL is an **absolute** quantity that is
    scale-dependent (it depends on the units and variance of the response
    variable). This makes it difficult to interpret in isolation. To
    compare models meaningfully, use **MSLL** instead, which normalizes
-   NLL against a baseline.
+   MLL against a baseline.
+
+This metric is adopted from `Section 2.5 of Gaussian Processes for
+Machine Learning book by C. E. Rasmussen & C. K. I.
+Williams <https://gaussianprocess.org/gpml/chapters/RW.pdf#page=27>`__.
 
 --------------
 
 MSLL — Mean standardized log loss
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-.. math:: \text{MSLL} = \underbrace{-\frac{1}{n}\sum_i \log p(y_i \mid \text{model})}_{\text{MLL}_{\text{model}}} - \underbrace{\left(-\frac{1}{n}\sum_i \log p(y_i \mid \text{baseline})\right)}_{\text{MLL}_{\text{null}}}
+.. math:: \text{MSLL} = \underbrace{-\frac{1}{n}\sum_i \log p(y \mid \mathcal{D}, x_*)}_{\text{MLL}_{\text{model}}} - \underbrace{\left(-\frac{1}{n}\sum_i \log \mathcal{N}\!\left(y \mid \bar{y},\, \hat{\sigma}^2\right)\right)}_{\text{MLL}_{\text{Gaussian baseline}}}
 
-MSLL is a relative metric. It compares the model’s log loss against a
-Gaussian baseline. The “standardized” in the name refers to this
+where the Gaussian baseline fits a single normal distribution to the
+training responses: - :math:`\bar{y} = \frac{1}{n}\sum_i y_i` — training
+sample mean -
+:math:`\hat{\sigma}^2 = \frac{1}{n}\sum_i (y_i - \bar{y})^2` — training
+sample variance
+
+MSLL is a relative metric. It compares the model’s mean log loss against
+a Gaussian baseline. The “standardized” in the name refers to this
 subtraction.
 
-======== ==================================================
+======== ============================================
 Value    Meaning
-======== ==================================================
-MSLL < 0 Model beats the baseline
-MSLL = 0 Model is equivalent to the naive Gaussian baseline
-MSLL > 0 Model is worse than the baseline
-======== ==================================================
+======== ============================================
+MSLL < 0 Model beats the Gaussian baseline
+MSLL = 0 Model is equivalent to the Gaussian baseline
+MSLL > 0 Model is worse than the Gaussian baseline
+======== ============================================
+
+This metric is adopted from `Section 2.5 of Gaussian Processes for
+Machine Learning book by C. E. Rasmussen & C. K. I.
+Williams <https://gaussianprocess.org/gpml/chapters/RW.pdf#page=27>`__.
 
 --------------
 
 MACE — Mean absolute centile error
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-.. math:: \text{MACE} = \frac{1}{|C|} \sum_{c \in C} \left| c - \hat{F}_c \right|
-
-where :math:`\hat{F}_c` is the empirical fraction of subjects whose true
-value falls below the predicted :math:`c`-th centile curve.
-
-MACE checks, for each predicted centile level (e.g. the 10th, 25th,
-50th, 75th, 95th centile curve), what fraction of subjects actually
-falls below it in the data. A perfectly calibrated model has exactly 10%
-of subjects below its 10th centile, 25% below its 25th centile, and so
-on. MACE averages the absolute deviation from this ideal across all
+.. math:: \text{MACE} = \frac{1}{b} \sum_{k=1}^{b} \left( \frac{1}{m} \sum_{j=1}^{m} \left| q_j - \frac{\sum_{i=1}^{n} \mathbf{1}\{\hat{q}_{ij} \geq y_i\}}{n} \right| \right)
+
+where: - :math:`b` are the unique combinations of batch effects -
+:math:`m` is the number of centiles used for calibration - :math:`q_j`
+is the :math:`j`-th target centile level (e.g. 0.05, 0.25, 0.50, 0.75,
+0.95) - :math:`\hat{q}_{ij}` is the predicted :math:`j`-th centile value
+for the :math:`i`-th subject - :math:`y_i` is the true value for the
+:math:`i`-th subject - :math:`n` is the number of subjects in the batch
+group - :math:`\mathbf{1}\{\hat{q}_{ij} \geq y_i\}` is an indicator
+function that outputs 1 or 0, depending on whether :math:`y_i` lies
+below or above its predicted :math:`j`-th centile value, respectively.
+So, :math:`\frac{\sum_i \mathbf{1}\{\hat{q}_{ij} \geq y_i\}}{n}` is the
+empirical fraction of subjects below the :math:`j`-th centile curve
+
+The maths above might seem complicated. To put simply, the MACE checks,
+for each predicted centile level (e.g. the 10th, 25th, 50th, 75th, 95th
+centile curve), what fraction of subjects actually falls below it in the
+data. A perfectly calibrated model has exactly 10% of subjects below its
+10th centile, 25% below its 25th centile, and so on. MACE averages the
+absolute deviation from this perfectly calibrated model across all
 centile levels.
 
+Important: MACE is averaged across unique combinations of batch effects
+(e.g., site and sex combinations) and each combination contributes
+equally. This means small groups have the same influence as large
+groups, and hence they may add disproportionate amount of noise to MACE.
+
 - MACE values close to 0 indicate the predicted centile curves closely
-  match the empirical distribution of the data.
+  match the distribution of the data.
 
 **Connection to the QQ plot:** The QQ plot is the “uncompressed” version
 of MACE. Each point on the QQ plot corresponds to MACE at a specific
 quantile level. Systematic deviations from the diagonal (e.g. an S-curve
 or U-curve) indicate where along the distribution calibration breaks
 down - information that MACE collapses into a single number.
 
+This metric is adopted from *equation 4* of this paper: > Zamanzadeh,
+M., Verduyn, Y., de Boer, A. et al. Normative modeling of MEG brain
+oscillations across the human lifespan. Commun Biol (2026).
+https://doi.org/10.1038/s42003-026-09825-2
+
 --------------
 
 ShapiroW — Shapiro–Wilk W statistic on Z-scores
@@ -272,6 +315,10 @@ Z-scores.
 |                        | structure                                   |
 +------------------------+---------------------------------------------+
 
+You can read more about the Shapiro–Wilk test in
+`this <https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test>`__
+wikipedia page.
+
 Summary table
 -------------
 
@@ -290,7 +337,7 @@ Summary table
 +------------+-----------------+---------------+--------------+-----------+
 | MAPE       | Point           | Y, Yhat       | Lower        | ≥ 0       |
 +------------+-----------------+---------------+--------------+-----------+
-| NLL        | Probabilistic   | logp          | Lower        | ≥ 0       |
+| MLL        | Probabilistic   | logp          | Lower        | ≥ 0       |
 +------------+-----------------+---------------+--------------+-----------+
 | MSLL       | Probabilistic   | logp,         | Lower        | unbounded |
 |            |                 | baseline_logp | (negative)   |           |

diff --git a/examples/12_federated_learning.ipynb b/examples/12_federated_learning.ipynb