Monday, November 17, 2025

A Practical Guide to Post-Hoc Pairwise Comparisons: Choosing Between Liberal and Conservative Tests

 

Introduction

When comparing k populations using ANOVA, there are m = k(k-1)/2 possible pairwise comparisons between means. If these comparisons were not pre-planned (also known as unplanned or post-hoc comparisons) and were chosen after the researcher examined the sample means, it is more appropriate to use a test that controls the significance level for the entire experiment, not just for an individual comparison.

Key Definitions

       ·        Comparisonwise Error Rate (CER): The probability of committing a Type I error when comparing a single pair of means from a set of k means.

      ·        Experimentwise Error Rate (EER) or Familywise Error Rate (FWER): The probability of committing at least one Type I error when performing all m pairwise comparisons from a set of k means.
Two specific types are distinguished:

  o   Complete Null EERC: The experimentwise error rate when all population means are truly equal.

  o   Partial Null EERC: The experimentwise error rate when some means are equal and others are not.

The Trade-Off: Power vs. Protection

Tests that control the experimentwise error rate are conservative—they reject the null hypothesis of equal means less easily, resulting in lower statistical power. Conversely, tests that control only the comparisonwise error rate are liberal, as they find significance more easily and therefore have higher power.

A Spectrum of Tests: From Liberal to Conservative
According to the classic classification by Winner (1962), multiple comparison tests can be ordered from most liberal to most conservative as follows:

       1.     Duncan's Multiple Range Test (MRT)

      2.     Student-Newman-Keuls Test (SNK)

      3.     Fisher's Least Significant Difference (LSD)

      4.     Tukey's Honestly Significant Difference (HSD)

      5.     Scheffé's Test

This means that applying Duncan's test will likely yield more statistically significant differences between means than using Scheffé's test.

Illustrative Example with Fictional Data

Means and standard deviations on blood pressure are depicted in Table 1 and the analysis of variance (ANOVA) in table 2.

Table 1: Blood Pressure Reduction (mmHg) by Treatment Group

Treatment

N

Mean(mmHg)

Standard deviation

A

5

21

5.10

B

5

8

7.07

C

5

10

5.83

D

5

29

5.10

E

5

13

7.07

Control

5

2

5.48


Table 2: Analysis of Variance (ANOVA)


      ·        Duncan's MRT and SNK: Both tests provide different critical values for the difference between means, depending on the rank of the means. Comparing the critical ranges shows that Duncan's test is more liberal, declaring significance more easily (its minimum significant differences are smaller than SNK's).

 

Table 3: Critical range for Duncan’s and Student Newman Keuls (SNK) tests

 

Test

Critical range for Number of Means in the Range (p)

2

3

4

5

6

Duncan´s

7.83

8.23

8.48

8.66

8.79

SNK

7.83

9.75

10.47

11.18

11.73

 

          ·        Fisher's LSD, Tukey's HSD, and Scheffé's Test: A comparison of the critical differences clearly shows the spectrum: Fisher's LSD is the most liberal (smallest critical difference), followed by Tukey's HSD, with Scheffé's test being the most conservative (largest critical difference).

 

                     Table 4: Critical Differences for Pairwise Comparison Tests


  

Test

Critical difference

Fisher's LSD

7.83

Tukey's HSD

11.73

Scheffé's

13.74

Practical Recommendations (Based on SAS/STAT 9.2 Manual)

        1.     Use the unprotected LSD test if you are interested in several individual comparisons and are not concerned with multiple inferences.

        2.     For all pairwise comparisons, use Tukey's test.

        3.     For comparisons with a control group, use Dunnett's test.

Choosing the Right Test: A Decision Framework

Imagine an experiment with more than two groups analyzed by a one-way ANOVA at a 5% significance level. For unplanned comparisons, the researcher has several options:

         ·        To control the experimentwise error rate at 5%, use Tukey's HSD (for all pairs) or Dunnett's test (vs. a control). The trade-off is a lower comparisonwise error rate.

        ·        For higher power, use Fisher's LSD (unprotected), Duncan's MRT, or SNK. These maintain a ~5% comparisonwise error rate, but the experimentwise error rate will be much higher (depending on the number of treatments).

Context Matters

        ·        Choose a Conservative Test (Tukey, Dunnett, planned LSD) when you need high confidence to reject H₀. This is crucial in fields like pharmacology, where recommending a new drug with unknown side effects requires strong evidence of its superiority over the standard treatment.

      ·        Choose a Liberal Test (unprotected LSD, Duncan) when you need high discriminatory power. This is common in product testing or agronomy, where the primary goal is to detect any potential difference, and a false positive is less consequential than missing a real difference. Alternatively, using a conservative test like Tukey at a 10% significance level also increases power.

Final Considerations

       ·        Scheffé's Test has excellent mathematical properties but is often considered excessively conservative for simple pairwise comparisons.

      ·        Bonferroni Correction is best suited for a small, pre-defined number of comparisons, as it becomes overly conservative with a large number of tests.

      ·        No Single Best Test: All procedures have advantages and disadvantages. While not exact, using a formal method for comparing means prevents conclusions from being entirely subjective. The researcher always has a margin of choice in both the selection of the test and the establishment of the significance level.

A Note on Software

The calculations for this guide were performed using SAS software. Results from other software packages or hand calculations may show slight differences due to rounding. Differences are typically more pronounced for the SNK test, as its critical values are less standardized across different sources.

 

No comments: