To
understand the issue of the p-value and the significance level, we need a bit
of statistical history. There are two schools of thought on hypothesis testing.
The first was popularized by Ronald A. Fisher in the 1920s. Fisher saw the
p-value not as part of a formal procedure for testing hypotheses, but as an
informal method for seeing how surprising a dataset can be. The p-value, when
combined with the researcher's knowledge of the subject and their research
experience, is useful for interpreting new data.
Figure
1. The Logic of the p-value
Note: The p-value, in the figure, is the shaded area under the normal distribution
After
Fisher's work was presented, Jerzy Neyman and Egon Pearson approached the
question differently. It's important to remember that in science, it is crucial
to limit two types of errors: false positives (when you think something is real
that isn't) and false negatives (when you think something that occurs is not
real).
As an
example, consider a laboratory test for diagnosing a certain disease. The test
can present two types of errors: a false positive, when it says the patient is
sick but they are not; and a false negative, when it says the patient is not
sick but they are. In statistics, it is convention to call the false positive a
Type I error and the false negative a Type II error, as shown in the scheme
presented in Table 1.
·
Type I Error (false positive): when
you say a treatment has an effect (affirm it) and that treatment has no effect.
·
Type II Error (false negative): when
you say a treatment has no effect (deny it) and that treatment has an effect.
Table 1. Type I Error and Type II Error
False
positives and false negatives are errors, but it is impossible to eliminate
them entirely. If you rush to find treatment effects, you will be prone to find
more false positives (i.e., commit more Type I errors); if you are
conservative, not rushing to point out treatment effects, you will be prone to
obtain more false negatives (commit more Type II errors).
Neyman
and Pearson reasoned that, although it is impossible to eliminate false
positives and false negatives, it is possible to develop a decision-making
process that guarantees false positives will occur with a pre-defined
probability. They called this probability the significance level,
denoted by the Greek letter α. Their
proposal was that researchers would define α based
on their experiences and expectations. Thus, someone willing to tolerate a 10%
probability of false positives would set α = 0.1,
but if they needed to be more conservative, they could set α at 0.01 or less.
Figure
2. The α vs. β Trade-off
Note: By decreasing α (making the test more rigorous), the area of H1
that is not in the rejection region, i.e., β, increases, and vice-versa.
How does
this work in practice? In the Neyman-Pearson system, once the null and
alternative hypotheses are defined, the significance level α is set. Then, using a computer program, a
statistical test is applied to determine the probability of obtaining a result
equal to or greater than the one found in the sample, when the null hypothesis
is true; that is, to determine the p-value. The Neyman-Pearson
procedure consists of rejecting the null hypothesis whenever the p-value is
less than or equal to the significance level α.
Therefore,
unlike Fisher's procedure, this method deliberately does not use the strength
of the evidence obtained in a particular experiment; it simply decides to
reject the hypothesis if p ≤ α. The
size of the p-value is not used to compare experiments, nor to draw conclusions
beyond "The null hypothesis should (or should not) be rejected."
Although
the Neyman and Pearson approach is conceptually different from Fisher's,
researchers merged the two. The Neyman-Pearson approach is where we get
"statistical significance" from, with a pre-chosen α value that guarantees the long-term probability of
false positives. In practice, people use the Neyman-Pearson threshold
(generally α = 0.05) to make a 'binary'
decision, but then interpret the calculated p-value (e.g., 0.032) in the spirit
of Fisher, as a measure of the strength of the evidence. This fusion is
convenient but a source of much confusion.
For
example, suppose you conduct an experiment and obtain a p-value = 0.032. If
your threshold is the conventional α = 0.05,
you have obtained a statistically significant result. It is tempting – though
wrong – to say "The probability of a Type I error is 3.2%." This
doesn't make sense because a single experiment does not determine a Type I
error probability. You should compare your experiment to others using only the
value of α.
Another
consideration is necessary: we said that when you decrease one type of error,
the other increases, considering the same problem being solved by the same
significance test. But some tests are more powerful than others. The power
of a test is defined as the probability of rejecting the null
hypothesis when it is false. Therefore, the best test is the one that has the
lowest probability of false negatives for a fixed value of α.
Reference
REINHART, A. Statistics Done Wrong. San Francisco, No Starch Press. 2015.
No comments:
Post a Comment