Sonia Vieira: 2025

Monday, November 17, 2025

A Practical Guide to Post-Hoc Pairwise Comparisons: Choosing Between Liberal and Conservative Tests

Introduction

When comparing k populations using ANOVA, there are m = k(k-1)/2 possible pairwise comparisons between means. If these comparisons were not pre-planned (also known as unplanned or post-hoc comparisons) and were chosen after the researcher examined the sample means, it is more appropriate to use a test that controls the significance level for the entire experiment, not just for an individual comparison.

Key Definitions

· Comparisonwise Error Rate (CER): The probability of committing a Type I error when comparing a single pair of means from a set of k means.

· Experimentwise Error Rate (EER) or Familywise Error Rate (FWER): The probability of committing at least one Type I error when performing all m pairwise comparisons from a set of k means.
Two specific types are distinguished:

o Complete Null EERC: The experimentwise error rate when all population means are truly equal.

o Partial Null EERC: The experimentwise error rate when some means are equal and others are not.

The Trade-Off: Power vs. Protection

Tests that control the experimentwise error rate are conservative—they reject the null hypothesis of equal means less easily, resulting in lower statistical power. Conversely, tests that control only the comparisonwise error rate are liberal, as they find significance more easily and therefore have higher power.

A Spectrum of Tests: From Liberal to Conservative
According to the classic classification by Winner (1962), multiple comparison tests can be ordered from most liberal to most conservative as follows:

1. Duncan's Multiple Range Test (MRT)

2. Student-Newman-Keuls Test (SNK)

3. Fisher's Least Significant Difference (LSD)

4. Tukey's Honestly Significant Difference (HSD)

5. Scheffé's Test

This means that applying Duncan's test will likely yield more statistically significant differences between means than using Scheffé's test.

Illustrative Example with Fictional Data

Means and standard deviations on blood pressure are depicted in Table 1 and the analysis of variance (ANOVA) in table 2.

Table 1: Blood Pressure Reduction (mmHg) by Treatment Group

Treatment	N	Mean(mmHg)	Standard deviation
A	5	21	5.10
B	5	8	7.07
C	5	10	5.83
D	5	29	5.10
E	5	13	7.07
Control	5	2	5.48

Table 2: Analysis of Variance (ANOVA)

· Duncan's MRT and SNK: Both tests provide different critical values for the difference between means, depending on the rank of the means. Comparing the critical ranges shows that Duncan's test is more liberal, declaring significance more easily (its minimum significant differences are smaller than SNK's).

Table 3: Critical range for Duncan’s and Student Newman Keuls (SNK) tests

Test	Critical range for Number of Means in the Range (p)
Test	2	3	4	5	6
Duncan´s	7.83	8.23	8.48	8.66	8.79
SNK	7.83	9.75	10.47	11.18	11.73

· Fisher's LSD, Tukey's HSD, and Scheffé's Test: A comparison of the critical differences clearly shows the spectrum: Fisher's LSD is the most liberal (smallest critical difference), followed by Tukey's HSD, with Scheffé's test being the most conservative (largest critical difference).

Table 4: Critical Differences for Pairwise Comparison Tests

Test	Critical difference
Fisher's LSD	7.83
Tukey's HSD	11.73
Scheffé's	13.74

Practical Recommendations (Based on SAS/STAT 9.2 Manual)

1. Use the unprotected LSD test if you are interested in several individual comparisons and are not concerned with multiple inferences.

2. For all pairwise comparisons, use Tukey's test.

3. For comparisons with a control group, use Dunnett's test.

Choosing the Right Test: A Decision Framework

Imagine an experiment with more than two groups analyzed by a one-way ANOVA at a 5% significance level. For unplanned comparisons, the researcher has several options:

· To control the experimentwise error rate at 5%, use Tukey's HSD (for all pairs) or Dunnett's test (vs. a control). The trade-off is a lower comparisonwise error rate.

· For higher power, use Fisher's LSD (unprotected), Duncan's MRT, or SNK. These maintain a ~5% comparisonwise error rate, but the experimentwise error rate will be much higher (depending on the number of treatments).

Context Matters

· Choose a Conservative Test (Tukey, Dunnett, planned LSD) when you need high confidence to reject H₀. This is crucial in fields like pharmacology, where recommending a new drug with unknown side effects requires strong evidence of its superiority over the standard treatment.

· Choose a Liberal Test (unprotected LSD, Duncan) when you need high discriminatory power. This is common in product testing or agronomy, where the primary goal is to detect any potential difference, and a false positive is less consequential than missing a real difference. Alternatively, using a conservative test like Tukey at a 10% significance level also increases power.

Final Considerations

· Scheffé's Test has excellent mathematical properties but is often considered excessively conservative for simple pairwise comparisons.

· Bonferroni Correction is best suited for a small, pre-defined number of comparisons, as it becomes overly conservative with a large number of tests.

· No Single Best Test: All procedures have advantages and disadvantages. While not exact, using a formal method for comparing means prevents conclusions from being entirely subjective. The researcher always has a margin of choice in both the selection of the test and the establishment of the significance level.

A Note on Software

The calculations for this guide were performed using SAS software. Results from other software packages or hand calculations may show slight differences due to rounding. Differences are typically more pronounced for the SNK test, as its critical values are less standardized across different sources.

Saturday, November 15, 2025

Paradigm

Any discussion on the History and Philosophy of Science necessarily involves the work of Thomas Samuel Kuhn (1922-1996), who authored, among other works, the book "The Structure of Scientific Revolutions."

In this book, Kuhn explains that scientific practice alternates between periods of "normal science" – based on a paradigm – and periods of "scientific revolution," when a paradigm shift occurs. Let's discuss the meaning of these terms.

Normal Science means conducting research based on established scientific achievements recognized by a specific community of scientists. Since scientists are committed to the same way of doing science, they share knowledge and ways of thinking. Thus, researchers learn how to solve problems through courses, laboratories, books, and manuals.

Puzzles and Exemplars
The search for solutions to problems, which Kuhn called "puzzles," is carried out using techniques learned from exemplars. These are the theories that dictate research methods and provide guidelines for scientific work.

Paradigm
Scientific research is guided not only by theories but by something broader: the paradigm. But what is a paradigm?

It is a set of practices that defines the behavior of scientists during a specific period.

In a given scientific field, you have a paradigm when you know:

· The established truths.

· What can be observed and examined.

· What kinds of questions can be asked and researched to obtain answers on the subject.

· How these questions should be structured.

· How the results of scientific research should be interpreted.

Kuhn conceptualized paradigm in 1970 as:

“a complete set of beliefs, values, techniques, and everything else that is shared by the members of a given community.”

Later, Kuhn explained that:

“paradigms [are] actual solutions to puzzles that, used as models or examples, can be treated as if they were explicit rules and serve as the basis for solving the remaining puzzles of normal science.”

However, the word "paradigm" took on a life of its own. In fact, Kuhn acknowledged that the concept escaped his initial intentions.
In the Brazilian Portuguese translation, the concept of paradigm is inadequately expressed. It states: “paradigms are universally recognized scientific realizations that, for a time, provide model problems and solutions for a community of practitioners of a science.”

Paradigm Shift
Changing a paradigm is not easy. It means "acquiring" new values (an effort) and erasing old ones (a greater effort).

Changing a paradigm is not about changing techniques, texts, or equipment, as some think – but rather acquiring a "new worldview," which can be shared by an entire community of scientists.

Ultimately, a paradigm's success depends on the space it creates for new discoveries. If a scientific achievement can solve puzzles that previously had no satisfactory solution and is sufficiently original to attract a group of good scientists to the point of making them abandon the paradigm they knew, then you are facing a "revolution."

Scientific Revolution occurs when a paradigm shift happens. Following this change, science evolves normally for some time within the new paradigm. But the force of a paradigm is powerful.

Most of the time, science exhibits adherence to the paradigm. The puzzles proposed for scientists to solve are confined within it. This would explain why scientific revolutions are rare.

Anomalous Cases
Science enters a crisis when confidence is lost in the paradigm's ability to resolve discrepant cases – the so-called "anomalous cases." This then paves the way for a scientific revolution and the construction of a new paradigm.

An Example of a Paradigm Shift

When Christian Barnard replaced one man's heart with another's, he showed the world that a person could live with another's heart. He displayed not just a result – but shattered a paradigm ("one must die when the heart dies") and, in its place, another emerged: "organs can be transplanted."

When a new paradigm emerges, the structure of the entire scientific community is affected. The acceptance of a new paradigm – at least for a time – is not due solely to logical resources or evidence, experimental or otherwise. The truth is that scientists adhering to different paradigms have different views of the same phenomenon (while one sees the Sun revolving around the Earth, the other sees the Earth revolving around the Sun).

Sometimes, it becomes impossible to justify a scientist's or a group of scientists' preference for a particular paradigm. Those defending the new paradigm can campaign and seek new adherents through conversion or simply wait for the most resistant to die out. But there will always be a period of accommodation. On the other hand, some recognize a change immediately.

A Recognition of Paradigm Shift

The dentist Horace Wells (1815-1848) was undoubtedly the first to use anesthesia in surgical procedures. At the time, the importance of his proposal was not recognized. But, to demonstrate the anesthetic effect of sulfuric ether, a Harvard Medical student asked the Professor of Surgery to anesthetize a patient scheduled for a leg amputation at Massachusetts General Hospital. This was done in 1846. The patient showed no signs of pain during the operation, and Professor John Warren Collins (1778-1856) was moved to tears. He immediately recognized the change in the course of surgical history.

But when this happens – a paradigm shift – many things also change: how a scientist sees the world; the criteria for selecting important problems; research techniques; how phenomena are interpreted; the criteria for evaluating theories.

Scientific activity is critical. Being critical implies admitting the probability of error. Therefore, since it is possible we are wrong, we must seek evidence for our judgments about facts. Furthermore, we must know that what is considered evidence today may not be evidence tomorrow. After all, we are confined within a specific time and place, to say the least.

References

1. Kuhn, T. S. The Structure of Scientific Revolutions. 3rd ed. The University Chicago Press, 1996.

2. Kuhn, T. S. The Structure of Scientific Revolutions. 2nd ed., University of Chicago Press, Chicago & London, 1970, p.175.

3. KATZ, J. Experimentation with human beings. New York: Russel Sage Foundation. 1973.

4. VIEIRA, S. e HOSSNE, W. S. Experimentação com seres humanos. São Paulo: Moderna, 1986.

Monday, October 27, 2025

A Practical Rule for Residual Degrees of Freedom in ANOVA

Imagine you are conducting an experiment in an area with a fertility gradient. The land is on a slope and is therefore more fertile at the bottom than at the top. You want to compare four treatments, which we will call A, B, C, and D, and you decide to arrange them in five blocks. Each block can accommodate four plots. The experimental design could be the one shown in Figure 1.

Figure 1: Layout of a randomized complete block design

Table 1 presents the Analysis of Variance (ANOVA) for this experiment.

Table 1: Analysis of Variance (ANOVA)

This design is appropriate because the variation within each block has been minimized (by grouping similar fertility levels together), and the variation between blocks has been maximized. But what can be said about the number of residual degrees of freedom?

The most repeated criticism in experimental work is that the sample size is too small. Sometimes it is also argued that the number of residual degrees of freedom should be greater than 10 or 12. But why?

Remember that you want to compare four treatments. Therefore, the degrees of freedom for treatments are necessarily 3. If you increase the sample size, by how much does the residual degrees of freedom increase? Look at Table 2, which shows the increase in residual degrees of freedom as the sample size—more specifically, the number of blocks—increases.

Table 2: Residual Degrees of Freedom for 4 Treatments and a Varying Number of Blocks

Now, observe Table 3 below. It provides some critical values of F for 3 degrees of freedom in the numerator (because you are comparing four treatments) and various degrees of freedom in the denominator (the residual). Notice that the critical F-values stabilize after the denominator has about 12 degrees of freedom. Therefore, increasing the number of blocks beyond this point does not help much in achieving statistical significance.

Table 3: Critical F-values at the 5% significance level for 3 numerator df and various denominator df.

This becomes clearer by looking at Figure 2. The F-value is what determines significance. So, your ability to detect differences between the means of the four treatments improves if you organize five blocks instead of four (the critical F decreases from 3.86 to 3.49). However, it does not improve as much if you use six blocks instead of five (the critical F only decreases from 3.49 to 3.29).

Figure 2: A graph plotting the data from Table 3, showing the critical F-value rapidly decreasing and then leveling off as the residual df increases.

This is the origin of the practical rule: aim for at least 12 residual degrees of freedom in the ANOVA. But note well: this is for 4 treatments. In agricultural sciences, it is common to compare 4 or even more treatments. Therefore, this rule is quite reasonable.

Summary

Here is a well-established and practical rule of thumb.

· The power of an ANOVA F-test to detect differences between treatments depends on the critical F-value.

· This critical F-value drops quickly as the residual degrees of freedom (df) increase from a low number but stabilizes after around 10-12 df.

· Therefore, beyond a certain point (e.g., 12 residual df), adding more replicates (blocks) provides diminishing returns for the cost and effort involved.

Follow me on Instagram: [@soniavieira.estatistica]