Skip to content

ENH: Implement the exact p-value of the Cramér-von Mises two-sample test#108

Draft
fbourgey wants to merge 3 commits intoscipy:mainfrom
fbourgey:add_pval_cvm_2samp_exact
Draft

ENH: Implement the exact p-value of the Cramér-von Mises two-sample test#108
fbourgey wants to merge 3 commits intoscipy:mainfrom
fbourgey:add_pval_cvm_2samp_exact

Conversation

@fbourgey
Copy link
Copy Markdown
Member

@fbourgey fbourgey commented Mar 13, 2026

Reference issue

Toward #98

What does this implement/fix?

Implement the exact p-value of the Cramer-von Mises two-sample test for a given value of the test statistic from _pval_cvm_2samp_exact

https://en.wikipedia.org/wiki/Cram%C3%A9r%E2%80%93von_Mises_criterion

@fbourgey fbourgey changed the title ENH: Implement ENH: Implement the exact p-value of the Cramer-von Mises two-sample test Mar 13, 2026
Comment thread include/xsf/stats.h Outdated
Comment thread include/xsf/stats.h
Comment on lines +330 to +357
for (int64_t u = 0; u < n + 1; ++u) {
std::vector<FreqTable> next_gs;
FreqTable tmp;
for (int64_t v = 0; v < m + 1; ++v) {
// Calculate g recursively with eq. 11 in [1]. Even though it
// doesn't look like it, this also does 12/13 (all of Algorithm 1).
const FreqTable &g = gs[v];
// Merge tmp and g: for common keys sum frequencies,
// keep unique keys from both sides.
// (equivalent to np.intersect1d + concatenate logic)
FreqTable merged;
for (const auto &[key, freq] : tmp) {
merged[key] += freq;
}
for (const auto &[key, freq] : g) {
merged[key] += freq;
}
int64_t diff = a * v - b * u;
int64_t res = diff * diff;
// tmp[0] += res (shift all keys by res)
tmp.clear();
for (const auto &[key, freq] : merged) {
tmp[key + res] += freq;
}
next_gs.push_back(tmp);
}
gs = std::move(next_gs);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide additional information that helps the reader compare this to either the original code, which is pretty hard to follow, or the algorithm, which looks simple but has terms like $I[a^2u(u+1)(2u + 1)/6]$ and $I[b^2u(u+1)(2u + 1)/6]$ that don't obviously appear in either version? If it would be cleaner, feel free to express this in terms of Algorithm 1 from the paper rather than the Python code. There are a lot of aspects of the Python that don't look like a very efficient expression of the algorithm (lots of placeholding size-zero array, use of arrays instead of dicts, use of delete, etc.), so if it can't be translated directly, it is probably not worth trying to maintain a resemblance.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree; I've found the Python code quite difficult to follow. I’ll provide more detail and include screenshots of the relevant sections of the paper in the coming days.

Comment thread include/xsf/stats.h
Comment on lines +315 to +318
using FreqTable = std::map<int64_t, int64_t>;
// the frequency table of g_{u, v}^+ defined in [1, p. 6]
// gs[0] = {0: 1}, gs[1..m] = empty
std::vector<FreqTable> gs(m + 1);
Copy link
Copy Markdown
Collaborator

@steppi steppi Mar 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdhaber, you want this to be able to work in CuPy right? std::map and std::vector aren't available in CUDA.

Also, rather than putting this whole thing into a scalar kernel, and making a ufunc out it, it seems like it may be better to take the idea of the original function and decompose it into the ufuncs that are needed to make it work. I think the key component here would be a gufunc to generate the frequency table. lcm and the binomial coefficients could be handled through delegation, and the rest looks like it could be made array-agnostic.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you want this to be able to work in CuPy right?

Yes. OK, good to know.

Comment thread include/xsf/stats.h
*/
inline double pval_cvm_2samp_exact(double s, int64_t m, int64_t n) {
// [1, p. 3]
int64_t lcm = std::lcm(m, n);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Page 3 in [1]: Definition of the variable lcm. This is $L$ in [1].

Image

Comment thread include/xsf/stats.h
Comment on lines +305 to +306
int64_t a = lcm / m;
int64_t b = lcm / n;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Page 4 in [1]: definitions of $a$ and $b$

Image

Comment thread include/xsf/stats.h
Comment on lines +307 to +311
// Combine Eq. 9 in [2] with Eq. 2 in [1] and solve for $\zeta$
// Hint: `s` is $U$ in [2], and $T_2$ in [1] is $T$ in [2]
int64_t mn = m * n;
// Uses double floor division since s is double
int64_t zeta = std::floor((lcm * lcm * (m + n) * (6.0 * s - mn * (4.0 * mn - 1))) / (6.0 * mn * mn));
Copy link
Copy Markdown
Member Author

@fbourgey fbourgey Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using consistent notations and following the Python comment:
$s$ is $U$ in [2] and $T_2$ in [1] same as $T$ in [2]

From [1, page 3]

Image

Rewriting in terms of $\zeta$, we have

$$ \zeta = \frac{(n+n)^2 L^2}{mn} T_2 $$

and from [2, eq. 9] (and using the same notations as in [1])

Image

$$ T_2 = \frac{s}{nm(n+m)} - \frac{4mn-1}{6(m+n)} $$

Simple algebra gives

$$ \zeta = \frac{(m+n)L^2}{6 m^2 n^2} (6 s - (4mn - 1) mn) $$

We then floor as $\zeta$ is an integer-valued statistic.

Comment thread include/xsf/stats.h
Comment on lines +318 to +319
std::vector<FreqTable> gs(m + 1);
gs[0][0] = 1;
Copy link
Copy Markdown
Member Author

@fbourgey fbourgey Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initialize the frequency tables of $\zeta$ for paths reaching $(u,v)$.

Each table is a $2 \times k$ array where $k =$ number of distinct values of $\zeta$ at $(u,v)$ ie

g = [[zeta_1, zeta_2, ..., zeta_k],
     [count_1, count_2, ..., count_k]]

row 0 = distinct values of $\zeta$ and row 1 = how many paths give each value

Initially: $g_{0,0}$ = {(0,1)}

@fbourgey fbourgey marked this pull request as draft March 31, 2026 20:07
@fbourgey fbourgey marked this pull request as draft March 31, 2026 20:07
Comment thread include/xsf/stats.h
Comment on lines +348 to +357
// (equivalent to return np.float64(np.sum(freq[value >= zeta]) / combinations))
const FreqTable &final_table = gs[m];
int64_t sum_freq = 0;
for (const auto &[value, freq] : final_table) {
if (value >= zeta) {
sum_freq += freq;
}
}
return static_cast<double>(sum_freq) / static_cast<double>(combinations);
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compute p-value

$$ \mathbb{P}(\zeta \geq \zeta_{obs}) = \frac{\text{number of paths with } \zeta \geq \zeta_{obs}}{\binom{m+n}{m}} $$

Comment thread include/xsf/stats.h
Comment on lines +337 to +343
int64_t diff = a * v - b * u;
int64_t res = diff * diff;
// tmp[0] += res (shift all keys by res)
tmp.clear();
for (const auto &[key, freq] : merged) {
tmp[key + res] += freq;
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apply shift $S_{d^2}$. From Eq. 11 of [1]

Image

Comment thread include/xsf/stats.h
Comment on lines +326 to +336
const FreqTable &g = gs[v];
// Merge tmp and g: for common keys sum frequencies,
// keep unique keys from both sides.
// (equivalent to np.intersect1d + concatenate logic)
FreqTable merged;
for (const auto &[key, freq] : tmp) {
merged[key] += freq;
}
for (const auto &[key, freq] : g) {
merged[key] += freq;
}
Copy link
Copy Markdown
Member Author

@fbourgey fbourgey Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Translation of the original Python code.

    vi, i0, i1 = np.intersect1d(tmp[0], g[0], return_indices=True)
    tmp = np.concatenate([
        np.stack([vi, tmp[1, i0] + g[1, i1]]),
        np.delete(tmp, i0, 1),
        np.delete(g, i1, 1)
    ], 1)
  • First line: find common $\zeta$ values between $g_{u,v-1}$ and $g_{u-1,v}$
  • Second line: this implements $g_{u,v-1} + g_{u-1,v}$

Comment thread include/xsf/stats.h
FreqTable tmp;
for (int64_t v = 0; v < m + 1; ++v) {
// Calculate g recursively with eq. 11 in [1]. Even though it
// doesn't look like it, this also does 12/13 (all of Algorithm 1).
Copy link
Copy Markdown
Member Author

@fbourgey fbourgey Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that (12) is not needed, as this is produced automatically by how gs and tmp are initialized.

For $v=0$, we have $d^2 = (av-bu)^2 = b^2 u^2$. Hence,

$$ g_{u,0} = S_{b^2 u^2} (g_{u-1,0}). $$

Starting from $g_{0,0} = I[0]$, at step $u=1$, $g_{1,0} = I[b^2]$, ... and for a general,

$$ g_{u,0} = I\left[\sum_{j=1}^{u} (bj)^2\right] = b^2 \frac{u(u+1)(2u+1)}{6} $$

Here, we use that $S_d(I[c]) = I[c+d])$

@fbourgey fbourgey changed the title ENH: Implement the exact p-value of the Cramer-von Mises two-sample test ENH: Implement the exact p-value of the Cramér-von Mises two-sample test Apr 2, 2026
@fbourgey fbourgey added the enhancement New feature or request label Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants