1.
Introduction
An analysis of variance (ANOVA) can indicate whether significant differences
exist among group means, but it does not identify which specific groups differ
from each other. To determine this, we use post hoc tests—also known as
multiple comparison procedures.
Among
the most well-known are Tukey’s test, the Student-Newman-Keuls (SNK) test, and
the Duncan Multiple Range Test (MRT). While all three compare means after a
significant ANOVA, they operate on different statistical principles. This
article explores what each test offers and the situations where each may be
most appropriate.
2. Approach to Statistical
Significance
·
Tukey’s HSD (Honest Significant Difference)
A conservative test that strictly controls the family-wise error rate
(FWER), which is the probability of making at least one Type I error
(falsely rejecting a true null hypothesis). In practice, this means Tukey’s
test identifies fewer statistically significant differences, but it offers
strong protection against false positives.
·
Student-Newman-Keuls (SNK)
A less conservative test than Tukey’s HSD. It employs a sequential stepwise
procedure that offers greater statistical power (the ability
to detect a true effect) than Tukey, but at the cost of a higher risk of Type I
errors across the set of comparisons. It controls the per-comparison
error rate but not the overall family-wise rate.
·
Duncan’s MRT (Multiple Range Test)
The most liberal of the three tests. It is designed to maximize power and
therefore tends to identify the largest number of significant differences.
However, this comes with a significantly increased risk of false positives
(Type I errors). For this reason, its use is often discouraged in formal,
confirmatory research contexts.
👉 In terms of statistical power:
·
Tukey may fail to detect subtle but true
differences (higher risk of Type II errors).
·
SNK offers a middle ground, often revealing
patterns that Tukey misses.
·
Duncan is the most powerful but deliberately
increases the Type I error rate to achieve this.
3. Key Conceptual Differences
·
Tukey: Uses a single critical value from
the studentized range distribution for all pairwise comparisons, rigorously
controlling the experiment-wise error rate.
·
SNK: Uses a sequential (stepwise) procedure.
It ranks the means and applies different critical values based on the number of
steps between means in the ordered list. It does not control the overall
experiment-wise error rate.
·
Duncan: Also uses a stepwise
procedure but varies the significance level (α) at each step, using a more liberal (larger) α for comparisons between means that are closer
together. This unique approach requires specialized tables, such as those found
in Harter (1960).
👉 In summary:
·
Tukey: Conservative; controls the overall
family-wise error.
·
SNK: Moderate; fixed α per comparison, no overall error control.
·
Duncan: Liberal; α varies, maximizes the chance of detecting
differences.
4.
General Procedure
The detailed computational procedures for each test are beyond the scope of
this text (and are typically handled by software). In brief:
·
Both Tukey and SNK are based on the studentized
range statistic.
·
Tukey applies a single critical value to all
pairwise differences.
·
SNK and Duncan are stepwise
tests. They begin by comparing the largest and smallest means in the ordered
set and then proceed to compare subsets of means. Duncan differs by adjusting
its significance level based on the range of means being compared.
5. When
to Use Each Test?
The choice among these tests is not purely statistical; it also involves
consideration of the consequences of error.
·
Tukey’s HSD (along with other
conservative tests like Scheffé’s) should be used when the cost of a false
positive is high. Examples include pharmaceutical trials, clinical research, or
any setting where acting on a false discovery could be dangerous or very costly.
·
SNK is a good intermediate option. It offers more
power than Tukey for detecting true effects while being less prone to false
positives than Duncan. It is useful in exploratory research where a balance
between risk and discovery is desired.
·
Duncan’s MRT is best suited for
highly exploratory analyses where the priority is to avoid missing any
potential effect (high power is paramount), and a higher number of false
positives is acceptable. Examples include preliminary screening of agricultural
varieties or product formulations where follow-up confirmation is planned.
⚠️ Warning: Many statisticians
discourage the use of Duncan’s test in formal confirmatory research due to its
liberal nature and lack of control over the family-wise error rate.
6.
Practical Example (Using Provided Data)
Raw Data:
·
Group A: 25, 26, 20, 23, 21 | Mean = 23.0
·
Group B: 31, 25, 28, 27, 24 | Mean = 27.0
·
Group C: 22, 26, 28, 25, 29 | Mean = 26.0
·
Group D: 33, 29, 31, 34, 28 | Mean = 31.0
The
ordered means are: D (31.0), B (27.0), C (26.0), A (23.0)
A
one-way ANOVA yielded a significant result F (3,16) = 7.798, p = 0.002),
justifying the use of post-hoc tests to compare groups.
The
results from the three post-hoc tests are summarized below.
Factor |
Mean |
Tukey Grouping |
SNK Grouping |
Duncan Grouping |
D |
31.0 |
a |
a |
a |
B |
27.0 |
a b |
b |
b |
C |
26.0 |
b |
b |
b c |
A |
23.0 |
b |
c |
c |
Interpretation of Results:
·
Tukey’s HSD: Suggests that Group D
is significantly different from Groups A and C, but not from Group B. It finds
no significant difference between Groups A, B, and C amongst themselves. This is the most conservative interpretation.
·
SNK Test: Provides a clearer
separation, identifying distinct tiers: 1) Group D (highest), 2) Groups B, C and
A are lowest. It detects that D is significantly higher than B a distinction
Tukey did not make.
·
Duncan’s MRT: Offers the most
granular separation. It agrees with SNK that Group D is the highest, and that
Group A is the lowest, a finding not supported by the other tests. This
highlights its liberal nature.
7.
Conclusion
This example clearly demonstrates that the choice of post-hoc test can directly
influence the conclusions drawn from an experiment. It is therefore crucial
to select the multiple comparison procedure a priori—before
data is collected—based on the research context and the balance between
tolerance for Type I (false positive) and Type II (false negative) errors. The
choice should be guided by statistical philosophy and the consequences of
error, not by which test yields the most desirable outcome.
No comments:
Post a Comment