Improve hypothesis tests, make them consistent with confidence intervals
Inconsitency between tests and confidence intervals
In their current implementation, the bootstrap p-values obtained from get_p_value are not consistent with the boostrap confidence intervals obtained from get_confidence_interval. You can end up with a p-value that suggests significance while the confidence interval includes the null (or vice-versa).
Here's an example:
set.seed(21922)
library(tidymodels)
# Compute the sample mean:
point_estimate <- gss |>
specify(response = hours) |>
calculate(stat = "mean")
# Generate samples from the null distribution:
null_dist <- gss |>
specify(response = hours) |>
hypothesize(null = "point", mu = 40) |>
generate(reps = 999, type = "bootstrap") |>
calculate(stat = "mean")
null_dist |> get_p_value(obs_stat = point_estimate,
direction = "two_sided")
The p-value is 0.04 ($H_0: \mu=40$ is rejected), but 40 is contained in the confidence interval (38.7, 41.3). This can occur also if the "se" or "bias-corrected" methods are used.
A better option for computing p-values would be to use confidence interval inversion, using that the p-value of the two-sided test for the parameter theta is the smallest alpha such that theta is not contained in the corresponding 1-alpha confidence interval. See Section 14.2 of Thulin (2024) or Section 3.12 in Hall (1992) for details.
Poor performance of hypothesis test
The hypothesis test performed by get_p_value (as shown in the documentation) is very naive, and has poor statistical performance. I ran simulations for some non-normal distributions to check its type I error rate compared to similar tests. For each distribution, I simulated 100,000 samples under the null hypothesis. For each of these, I then ran the tests at the 5% significance level, and checked for how many runs each test yielded a significant result. The closer this proportion is to 0.05, the better the test performs in terms of the type I error rate. Here are the results for different combinations of distributions and sample sizes:
| Sample size |
t-test |
tidymodels test |
boot.pval studentised test |
boot.pval percentile test |
| Bimodal, n=25 |
0.0517 |
0.0828 |
0.0394 |
0.0696 |
| Bimodal, n=50 |
0.0508 |
0.06934 |
0.0502 |
0.0640 |
| Right-skew, n = 25 |
0.0700 |
0.1012 |
0.0656 |
0.0885 |
| Right-skew, n = 50 |
0.0602 |
0.0812 |
0.0577 |
0.0721 |
| Left-skew, n = 25 |
0.0681 |
0.0998 |
0.0639 |
0.0869 |
| Left-skew, n = 50 |
0.0587 |
0.0793 |
0.0560 |
0.0718 |
| Very right-skew, n = 25 |
0.0850 |
0.1240 |
0.0639 |
0.0980 |
| Very right-skew, n = 50 |
0.0682 |
0.0936 |
0.0552 |
0.0784 |
None of the tests perform perfectly. But the get_p_value test actually always performs worse than a regular t-test here! This could be avoided e.g. by using a inverted bootstrap t confidence interval (see issue #570 ) to compute the p-value (which is what the studentised test from the boot.pval package does).
Improve hypothesis tests, make them consistent with confidence intervals
Inconsitency between tests and confidence intervals
In their current implementation, the bootstrap p-values obtained from
get_p_valueare not consistent with the boostrap confidence intervals obtained fromget_confidence_interval. You can end up with a p-value that suggests significance while the confidence interval includes the null (or vice-versa).Here's an example:
The p-value is 0.04 ($H_0: \mu=40$ is rejected), but 40 is contained in the confidence interval (38.7, 41.3). This can occur also if the
"se"or"bias-corrected"methods are used.A better option for computing p-values would be to use confidence interval inversion, using that the p-value of the two-sided test for the parameter theta is the smallest alpha such that theta is not contained in the corresponding 1-alpha confidence interval. See Section 14.2 of Thulin (2024) or Section 3.12 in Hall (1992) for details.
Poor performance of hypothesis test
The hypothesis test performed by
get_p_value(as shown in the documentation) is very naive, and has poor statistical performance. I ran simulations for some non-normal distributions to check its type I error rate compared to similar tests. For each distribution, I simulated 100,000 samples under the null hypothesis. For each of these, I then ran the tests at the 5% significance level, and checked for how many runs each test yielded a significant result. The closer this proportion is to 0.05, the better the test performs in terms of the type I error rate. Here are the results for different combinations of distributions and sample sizes:None of the tests perform perfectly. But the
get_p_valuetest actually always performs worse than a regular t-test here! This could be avoided e.g. by using a inverted bootstrap t confidence interval (see issue #570 ) to compute the p-value (which is what the studentised test from theboot.pvalpackage does).