Skip to content

Improve hypothesis tests, make them consistent with confidence intervals. #571

@mthulin

Description

@mthulin

Improve hypothesis tests, make them consistent with confidence intervals

Inconsitency between tests and confidence intervals

In their current implementation, the bootstrap p-values obtained from get_p_value are not consistent with the boostrap confidence intervals obtained from get_confidence_interval. You can end up with a p-value that suggests significance while the confidence interval includes the null (or vice-versa).

Here's an example:

set.seed(21922) 
library(tidymodels)

# Compute the sample mean:
point_estimate <- gss |> 
    specify(response = hours) |>
    calculate(stat = "mean")

# Generate samples from the null distribution:
null_dist <- gss |>
    specify(response = hours) |>
    hypothesize(null = "point", mu = 40) |>
    generate(reps = 999, type = "bootstrap") |>
    calculate(stat = "mean")

null_dist |> get_p_value(obs_stat = point_estimate,
                         direction = "two_sided")

The p-value is 0.04 ($H_0: \mu=40$ is rejected), but 40 is contained in the confidence interval (38.7, 41.3). This can occur also if the "se" or "bias-corrected" methods are used.

A better option for computing p-values would be to use confidence interval inversion, using that the p-value of the two-sided test for the parameter theta is the smallest alpha such that theta is not contained in the corresponding 1-alpha confidence interval. See Section 14.2 of Thulin (2024) or Section 3.12 in Hall (1992) for details.

Poor performance of hypothesis test

The hypothesis test performed by get_p_value (as shown in the documentation) is very naive, and has poor statistical performance. I ran simulations for some non-normal distributions to check its type I error rate compared to similar tests. For each distribution, I simulated 100,000 samples under the null hypothesis. For each of these, I then ran the tests at the 5% significance level, and checked for how many runs each test yielded a significant result. The closer this proportion is to 0.05, the better the test performs in terms of the type I error rate. Here are the results for different combinations of distributions and sample sizes:

Sample size t-test tidymodels test boot.pval studentised test boot.pval percentile test
Bimodal, n=25 0.0517 0.0828 0.0394 0.0696
Bimodal, n=50 0.0508 0.06934 0.0502 0.0640
Right-skew, n = 25 0.0700 0.1012 0.0656 0.0885
Right-skew, n = 50 0.0602 0.0812 0.0577 0.0721
Left-skew, n = 25 0.0681 0.0998 0.0639 0.0869
Left-skew, n = 50 0.0587 0.0793 0.0560 0.0718
Very right-skew, n = 25 0.0850 0.1240 0.0639 0.0980
Very right-skew, n = 50 0.0682 0.0936 0.0552 0.0784

None of the tests perform perfectly. But the get_p_value test actually always performs worse than a regular t-test here! This could be avoided e.g. by using a inverted bootstrap t confidence interval (see issue #570 ) to compute the p-value (which is what the studentised test from the boot.pval package does).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions