Improve hypothesis tests, make them consistent with confidence intervals.

## Improve hypothesis tests, make them consistent with confidence intervals

### Inconsitency between tests and confidence intervals
In their current implementation, the bootstrap p-values obtained from `get_p_value` are not consistent with the boostrap confidence intervals obtained from `get_confidence_interval`.  You can end up with a p-value that suggests significance while the confidence interval includes the null (or vice-versa).

Here's an example:

```
set.seed(21922) 
library(tidymodels)

# Compute the sample mean:
point_estimate <- gss |> 
    specify(response = hours) |>
    calculate(stat = "mean")

# Generate samples from the null distribution:
null_dist <- gss |>
    specify(response = hours) |>
    hypothesize(null = "point", mu = 40) |>
    generate(reps = 999, type = "bootstrap") |>
    calculate(stat = "mean")

null_dist |> get_p_value(obs_stat = point_estimate,
                         direction = "two_sided")

```
The p-value is 0.04 ($H_0: \mu=40$ is rejected), but 40 is contained in the confidence interval (38.7, 41.3). This can occur also if the `"se"` or `"bias-corrected"` methods are used.

A better option for computing p-values would be to use confidence interval inversion, using that the p-value of the two-sided test for the parameter theta is the smallest alpha such that theta is not contained in the corresponding 1-alpha confidence interval. See Section 14.2 of [Thulin (2024)](https://www.modernstatisticswithr.com/mathschap.html#confintequal) or Section 3.12 in [Hall (1992)](https://link.springer.com/book/10.1007/978-1-4612-4384-7) for details.

### Poor performance of hypothesis test
The hypothesis test performed by `get_p_value` (as shown in the documentation) is very naive, and has poor statistical performance. I ran simulations for some non-normal distributions to check its type I error rate compared to similar tests. For each distribution, I simulated 100,000 samples under the null hypothesis. For each of these, I then ran the tests at the 5% significance level, and checked for how many runs each test yielded a significant result. The closer this proportion is to 0.05, the better the test performs in terms of the type I error rate. Here are the results for different combinations of distributions and sample sizes:

| **Sample size** | **t-test** | **tidymodels test** | **boot.pval studentised test** | **boot.pval percentile test** | 
| :--- | :--- | :--- | :--- | :--- | 
Bimodal, n=25 | 0.0517 | 0.0828 | 0.0394 |  0.0696 |
Bimodal, n=50 | 0.0508 | 0.06934 | 0.0502 | 0.0640  |
Right-skew, n = 25 | 0.0700 | 0.1012 |  0.0656 | 0.0885 |
Right-skew, n = 50 | 0.0602 | 0.0812 | 0.0577  | 0.0721 |
Left-skew, n = 25 | 0.0681 |  0.0998 |  0.0639 | 0.0869 |
Left-skew, n = 50 | 0.0587 | 0.0793  |  0.0560 | 0.0718 |
Very right-skew, n = 25 | 0.0850 | 0.1240 | 0.0639 | 0.0980 |
Very right-skew, n = 50 | 0.0682 | 0.0936 | 0.0552 | 0.0784 |

None of the tests perform perfectly. But the `get_p_value` test actually always performs worse than a regular t-test here! This could be avoided e.g. by using a inverted bootstrap t confidence interval (see issue #570 ) to compute the p-value (which is what the studentised test from the `boot.pval` package does).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve hypothesis tests, make them consistent with confidence intervals. #571

Improve hypothesis tests, make them consistent with confidence intervals

Inconsitency between tests and confidence intervals

Poor performance of hypothesis test

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Sample size	t-test	tidymodels test	boot.pval studentised test	boot.pval percentile test
Bimodal, n=25	0.0517	0.0828	0.0394	0.0696
Bimodal, n=50	0.0508	0.06934	0.0502	0.0640
Right-skew, n = 25	0.0700	0.1012	0.0656	0.0885
Right-skew, n = 50	0.0602	0.0812	0.0577	0.0721
Left-skew, n = 25	0.0681	0.0998	0.0639	0.0869
Left-skew, n = 50	0.0587	0.0793	0.0560	0.0718
Very right-skew, n = 25	0.0850	0.1240	0.0639	0.0980
Very right-skew, n = 50	0.0682	0.0936	0.0552	0.0784

Improve hypothesis tests, make them consistent with confidence intervals. #571

Description

Improve hypothesis tests, make them consistent with confidence intervals

Inconsitency between tests and confidence intervals

Poor performance of hypothesis test

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions