Sonia Vieira: p-value vs.significance level: untangling the statistics that even researchers confuse

To understand the issue of the p-value and the significance level, we need a bit of statistical history. There are two schools of thought on hypothesis testing. The first was popularized by Ronald A. Fisher in the 1920s. Fisher saw the p-value not as part of a formal procedure for testing hypotheses, but as an informal method for seeing how surprising a dataset can be. The p-value, when combined with the researcher's knowledge of the subject and their research experience, is useful for interpreting new data.

Figure 1. The Logic of the p-value

Note: The p-value, in the figure, is the shaded area under the normal distribution

curve representing the probability of observing a test result as extreme as or more

extreme than the observed value, when the null hypothesis is true.

After Fisher's work was presented, Jerzy Neyman and Egon Pearson approached the question differently. It's important to remember that in science, it is crucial to limit two types of errors: false positives (when you think something is real that isn't) and false negatives (when you think something that occurs is not real).

As an example, consider a laboratory test for diagnosing a certain disease. The test can present two types of errors: a false positive, when it says the patient is sick but they are not; and a false negative, when it says the patient is not sick but they are. In statistics, it is convention to call the false positive a Type I error and the false negative a Type II error, as shown in the scheme presented in Table 1.

· Type I Error (false positive): when you say a treatment has an effect (affirm it) and that treatment has no effect.

· Type II Error (false negative): when you say a treatment has no effect (deny it) and that treatment has an effect.

Table 1. Type I Error and Type II Error

False positives and false negatives are errors, but it is impossible to eliminate them entirely. If you rush to find treatment effects, you will be prone to find more false positives (i.e., commit more Type I errors); if you are conservative, not rushing to point out treatment effects, you will be prone to obtain more false negatives (commit more Type II errors).

Neyman and Pearson reasoned that, although it is impossible to eliminate false positives and false negatives, it is possible to develop a decision-making process that guarantees false positives will occur with a pre-defined probability. They called this probability the significance level, denoted by the Greek letter α. Their proposal was that researchers would define α based on their experiences and expectations. Thus, someone willing to tolerate a 10% probability of false positives would set α = 0.1, but if they needed to be more conservative, they could set α at 0.01 or less.

Figure 2. The α vs. β Trade-off

Note: By decreasing α (making the test more rigorous), the area of H1

that is not in the rejection region, i.e., β, increases, and vice-versa.

How does this work in practice? In the Neyman-Pearson system, once the null and alternative hypotheses are defined, the significance level α is set. Then, using a computer program, a statistical test is applied to determine the probability of obtaining a result equal to or greater than the one found in the sample, when the null hypothesis is true; that is, to determine the p-value. The Neyman-Pearson procedure consists of rejecting the null hypothesis whenever the p-value is less than or equal to the significance level α.

Therefore, unlike Fisher's procedure, this method deliberately does not use the strength of the evidence obtained in a particular experiment; it simply decides to reject the hypothesis if p ≤ α. The size of the p-value is not used to compare experiments, nor to draw conclusions beyond "The null hypothesis should (or should not) be rejected."

Although the Neyman and Pearson approach is conceptually different from Fisher's, researchers merged the two. The Neyman-Pearson approach is where we get "statistical significance" from, with a pre-chosen α value that guarantees the long-term probability of false positives. In practice, people use the Neyman-Pearson threshold (generally α = 0.05) to make a 'binary' decision, but then interpret the calculated p-value (e.g., 0.032) in the spirit of Fisher, as a measure of the strength of the evidence. This fusion is convenient but a source of much confusion.

For example, suppose you conduct an experiment and obtain a p-value = 0.032. If your threshold is the conventional α = 0.05, you have obtained a statistically significant result. It is tempting – though wrong – to say "The probability of a Type I error is 3.2%." This doesn't make sense because a single experiment does not determine a Type I error probability. You should compare your experiment to others using only the value of α.

Another consideration is necessary: we said that when you decrease one type of error, the other increases, considering the same problem being solved by the same significance test. But some tests are more powerful than others. The power of a test is defined as the probability of rejecting the null hypothesis when it is false. Therefore, the best test is the one that has the lowest probability of false negatives for a fixed value of α.

Reference
REINHART, A. Statistics Done Wrong. San Francisco, No Starch Press. 2015.

Sonia Vieira

Saturday, November 29, 2025

p-value vs.significance level: untangling the statistics that even researchers confuse

No comments: