Saturday, November 29, 2025

p-value vs.significance level: untangling the statistics that even researchers confuse

 

To understand the issue of the p-value and the significance level, we need a bit of statistical history. There are two schools of thought on hypothesis testing. The first was popularized by Ronald A. Fisher in the 1920s. Fisher saw the p-value not as part of a formal procedure for testing hypotheses, but as an informal method for seeing how surprising a dataset can be. The p-value, when combined with the researcher's knowledge of the subject and their research experience, is useful for interpreting new data.

Figure 1. The Logic of the p-value

                        Note: The p-value, in the figure, is the shaded  area under the normal distribution
                      curve representing the probability of observing a test result as extreme as or more
                      extreme than the observed value, when the null hypothesis is true.

After Fisher's work was presented, Jerzy Neyman and Egon Pearson approached the question differently. It's important to remember that in science, it is crucial to limit two types of errors: false positives (when you think something is real that isn't) and false negatives (when you think something that occurs is not real).

As an example, consider a laboratory test for diagnosing a certain disease. The test can present two types of errors: a false positive, when it says the patient is sick but they are not; and a false negative, when it says the patient is not sick but they are. In statistics, it is convention to call the false positive a Type I error and the false negative a Type II error, as shown in the scheme presented in Table 1.

·     Type I Error (false positive): when you say a treatment has an effect (affirm it) and that treatment has no effect.

·     Type II Error (false negative): when you say a treatment has no effect (deny it) and that treatment has an effect.

Table 1. Type I Error and Type II Error

False positives and false negatives are errors, but it is impossible to eliminate them entirely. If you rush to find treatment effects, you will be prone to find more false positives (i.e., commit more Type I errors); if you are conservative, not rushing to point out treatment effects, you will be prone to obtain more false negatives (commit more Type II errors).

Neyman and Pearson reasoned that, although it is impossible to eliminate false positives and false negatives, it is possible to develop a decision-making process that guarantees false positives will occur with a pre-defined probability. They called this probability the significance level, denoted by the Greek letter α. Their proposal was that researchers would define α based on their experiences and expectations. Thus, someone willing to tolerate a 10% probability of false positives would set α = 0.1, but if they needed to be more conservative, they could set α at 0.01 or less.

Figure 2. The α vs. β Trade-off

             Note: By decreasing α (making the test more rigorous), the area of H1

 that is not in the rejection region, i.e., β, increases, and vice-versa.

How does this work in practice? In the Neyman-Pearson system, once the null and alternative hypotheses are defined, the significance level α is set. Then, using a computer program, a statistical test is applied to determine the probability of obtaining a result equal to or greater than the one found in the sample, when the null hypothesis is true; that is, to determine the p-value. The Neyman-Pearson procedure consists of rejecting the null hypothesis whenever the p-value is less than or equal to the significance level α.

Therefore, unlike Fisher's procedure, this method deliberately does not use the strength of the evidence obtained in a particular experiment; it simply decides to reject the hypothesis if p ≤ α. The size of the p-value is not used to compare experiments, nor to draw conclusions beyond "The null hypothesis should (or should not) be rejected."

Although the Neyman and Pearson approach is conceptually different from Fisher's, researchers merged the two. The Neyman-Pearson approach is where we get "statistical significance" from, with a pre-chosen α value that guarantees the long-term probability of false positives. In practice, people use the Neyman-Pearson threshold (generally α = 0.05) to make a 'binary' decision, but then interpret the calculated p-value (e.g., 0.032) in the spirit of Fisher, as a measure of the strength of the evidence. This fusion is convenient but a source of much confusion.

For example, suppose you conduct an experiment and obtain a p-value = 0.032. If your threshold is the conventional α = 0.05, you have obtained a statistically significant result. It is tempting – though wrong – to say "The probability of a Type I error is 3.2%." This doesn't make sense because a single experiment does not determine a Type I error probability. You should compare your experiment to others using only the value of α.

Another consideration is necessary: we said that when you decrease one type of error, the other increases, considering the same problem being solved by the same significance test. But some tests are more powerful than others. The power of a test is defined as the probability of rejecting the null hypothesis when it is false. Therefore, the best test is the one that has the lowest probability of false negatives for a fixed value of α.

Reference
REINHART, A. Statistics Done Wrong. San Francisco, No Starch Press.
2015.

 

Tuesday, November 25, 2025

Analysis of Variance (ANOVA): Assumptions and Data Transformation

 

1.     Introduction

 

The assumptions required for an analysis of variance (ANOVA) are not always perfectly met by real-world data. However, researchers who choose this procedure need to be assured that, even if they do not fully meet the necessary assumptions (normality of residuals and homogeneity of variances), their data will still be suitable.

 

It is known that minor deviations from normality do not seriously compromise the validity of the ANOVA, especially when group sizes are equal or similar. Similarly, minor violations of homogeneity of variances have little practical relevance - except in two critical situations:

 

         1) when there is asymmetry in the residuals;

         (2) when there is positive kurtosis in the residuals.

 

In any case, the F-test remains the most powerful of the available tests provided that its assumptions are met. Otherwise, researchers should consider using non-parametric tests or resorting to data transformation. Transformations are particularly useful for stabilising the variance, but also generally help to approximate the distribution to normality.

 

2.     What does it mean to transform data? 

 

Transforming data involves applying a mathematical operation to each observation and conducting statistical analyses using the resulting values. The best-known transformations are listed below.

 

2.1.         Square Root 

 

In general, variables obtained by counting do not have a constant variance or a normal distribution. For count data (e.g. the number of insects or bacterial colonies, or the prevalence of lesions), it is recommended that the square root is applied to each observation before proceeding with ANOVA. This usually results in a more constant variance.

Practical note: If the observed values are small (less than 10) or there are many zeros, it is recommended, to avoid problems with the square root of zero, use the Anscombe transformation:

​​

or a simplified, older correction that is also effective:

before conducting the analysis.

2.2.      Logarithm

Many biological variables (such as tree height, body weight and survival time) follow a lognormal distribution. In these cases, taking the logarithm (decimal or natural) of the variable helps stabilize the variance and approximate the distribution to normality. One classic indication that this transformation is needed is when the variance of the groups increases proportionally with the mean.

2.3.      Arc sine of the square root of the proportion

If the variable is a proportion or percentage (e.g. the percentage of seeds that germinate), ANOVA can only be applied directly if the proportions strictly vary between 0.3 and 0.7. If many values fall outside this range, it is recommended that the transformation is applied.

                                              Y = arcsin(√p).

 3. Final considerations

For those unfamiliar with statistics, transforming data may seem like suspicious 'manipulation'. It is not. It is a legitimate and widely accepted technique that is often necessary when alternatives are unavailable.

Although modern software offers alternative methods, such as Welch's test for one-way analysis of variance, transforming the original variable may be the only feasible and robust approach to satisfy the model assumptions for more complex analysis of variance models, such as split-plot designs or hierarchical models.

Researchers must always be able to justify their chosen transformation and, ideally, use the most common transformation in their field of study.

Important: even if the statistical analysis was performed using transformed data, the descriptive results (means, standard errors, graphs, etc.) must be presented on the original scale of the variable. To achieve this, the transformation must be 'undone' (back-transformed) using the inverse function of the original transformation.


Monday, November 24, 2025

Linear Regression Through the Origin: When to Force the Intercept to Zero

 

In regression analysis, the most common model includes an intercept term (constant). However, in specific situations, we are forced to make the regression line pass through the origin of the Cartesian plane (point (0,0)). This decision can be motivated by solid theoretical reasons or prior empirical evidence.

Why Use a Model Without an Intercept?

Two classic examples illustrate this need:

1.     Uniform Rectilinear Motion: In Physics, if an object starts from rest on a straight path, at the initial moment (time zero) the distance traveled is necessarily zero. A model that does not pass through the origin would make no physical sense.

2.     Young's Modulus: In Materials Engineering, Young's modulus, which measures the stiffness of a material, is defined by the slope of the stress-strain curve in the elastic regime. If no stress is applied, there is no strain. Therefore, the line modeling this behavior must pass through the origin. Figure 1 illustrates this relationship in the context of Young's modulus.

                                                              Figure 1


Although useful, adopting a no-intercept model should be done cautiously. It is good practice to compare its performance with the model that includes an intercept. The final choice can be controversial and intrinsically depends on the problem's context.

The Mathematical Model

By forcing the line through the origin, our model simplifies to:

Where:

    ·  X is the independent variable.

    ·  Y is the dependent variable.

   ·    b is the parameter (slope) we want to estimate.

    ·   e is the random error term.


 The estimate for the coefficient b is given by the formula:


   The fitted regression line will therefore be:


Evaluating the Model Fit

A crucial difference from the model with an intercept is that the sum of the residuals (Σei) is not necessarily zero. By forcing the line through (0,0), we lose the degree of freedom that "adjusted" the line's height to minimize the residuals.

To assess the quality of the fit, we use analysis of variance (ANOVA). The degrees of freedom are adjusted as follows:

·      Total SS: n degrees of freedom.

·      Regression SS: k degrees of freedom (where k -1).

·      Residual SS: n-k degrees of freedom.

Based on these calculations, we build the ANOVA table (Table 1).

Table 1


Practical Example

Consider the data in Table 2, where we want to fit a model that passes through the origin.

Table 2


With the data from Table 2, we calculate the coefficient b:

Thus, the equation of the regression line is:

It is also important to calculate quality metrics:

·      Standard Deviation (s)

·      Coefficient of Determination (R²)

·      t-value

For our data:


Figure 2 shows the scatter plot with the fitted regression line.

Figure 2

Checking the Result in Software

To validate our manual calculations, we can use statistical software. The Minitab output for this analysis is presented below and should corroborate our results.







Monday, November 17, 2025

A Practical Guide to Post-Hoc Pairwise Comparisons: Choosing Between Liberal and Conservative Tests

 

Introduction

When comparing k populations using ANOVA, there are m = k(k-1)/2 possible pairwise comparisons between means. If these comparisons were not pre-planned (also known as unplanned or post-hoc comparisons) and were chosen after the researcher examined the sample means, it is more appropriate to use a test that controls the significance level for the entire experiment, not just for an individual comparison.

Key Definitions

       ·        Comparisonwise Error Rate (CER): The probability of committing a Type I error when comparing a single pair of means from a set of k means.

      ·        Experimentwise Error Rate (EER) or Familywise Error Rate (FWER): The probability of committing at least one Type I error when performing all m pairwise comparisons from a set of k means.
Two specific types are distinguished:

  o   Complete Null EERC: The experimentwise error rate when all population means are truly equal.

  o   Partial Null EERC: The experimentwise error rate when some means are equal and others are not.

The Trade-Off: Power vs. Protection

Tests that control the experimentwise error rate are conservative—they reject the null hypothesis of equal means less easily, resulting in lower statistical power. Conversely, tests that control only the comparisonwise error rate are liberal, as they find significance more easily and therefore have higher power.

A Spectrum of Tests: From Liberal to Conservative
According to the classic classification by Winner (1962), multiple comparison tests can be ordered from most liberal to most conservative as follows:

       1.     Duncan's Multiple Range Test (MRT)

      2.     Student-Newman-Keuls Test (SNK)

      3.     Fisher's Least Significant Difference (LSD)

      4.     Tukey's Honestly Significant Difference (HSD)

      5.     Scheffé's Test

This means that applying Duncan's test will likely yield more statistically significant differences between means than using Scheffé's test.

Illustrative Example with Fictional Data

Means and standard deviations on blood pressure are depicted in Table 1 and the analysis of variance (ANOVA) in table 2.

Table 1: Blood Pressure Reduction (mmHg) by Treatment Group

Treatment

N

Mean(mmHg)

Standard deviation

A

5

21

5.10

B

5

8

7.07

C

5

10

5.83

D

5

29

5.10

E

5

13

7.07

Control

5

2

5.48


Table 2: Analysis of Variance (ANOVA)


      ·        Duncan's MRT and SNK: Both tests provide different critical values for the difference between means, depending on the rank of the means. Comparing the critical ranges shows that Duncan's test is more liberal, declaring significance more easily (its minimum significant differences are smaller than SNK's).

 

Table 3: Critical range for Duncan’s and Student Newman Keuls (SNK) tests

 

Test

Critical range for Number of Means in the Range (p)

2

3

4

5

6

Duncan´s

7.83

8.23

8.48

8.66

8.79

SNK

7.83

9.75

10.47

11.18

11.73

 

          ·        Fisher's LSD, Tukey's HSD, and Scheffé's Test: A comparison of the critical differences clearly shows the spectrum: Fisher's LSD is the most liberal (smallest critical difference), followed by Tukey's HSD, with Scheffé's test being the most conservative (largest critical difference).

 

                     Table 4: Critical Differences for Pairwise Comparison Tests


  

Test

Critical difference

Fisher's LSD

7.83

Tukey's HSD

11.73

Scheffé's

13.74

Practical Recommendations (Based on SAS/STAT 9.2 Manual)

        1.     Use the unprotected LSD test if you are interested in several individual comparisons and are not concerned with multiple inferences.

        2.     For all pairwise comparisons, use Tukey's test.

        3.     For comparisons with a control group, use Dunnett's test.

Choosing the Right Test: A Decision Framework

Imagine an experiment with more than two groups analyzed by a one-way ANOVA at a 5% significance level. For unplanned comparisons, the researcher has several options:

         ·        To control the experimentwise error rate at 5%, use Tukey's HSD (for all pairs) or Dunnett's test (vs. a control). The trade-off is a lower comparisonwise error rate.

        ·        For higher power, use Fisher's LSD (unprotected), Duncan's MRT, or SNK. These maintain a ~5% comparisonwise error rate, but the experimentwise error rate will be much higher (depending on the number of treatments).

Context Matters

        ·        Choose a Conservative Test (Tukey, Dunnett, planned LSD) when you need high confidence to reject H₀. This is crucial in fields like pharmacology, where recommending a new drug with unknown side effects requires strong evidence of its superiority over the standard treatment.

      ·        Choose a Liberal Test (unprotected LSD, Duncan) when you need high discriminatory power. This is common in product testing or agronomy, where the primary goal is to detect any potential difference, and a false positive is less consequential than missing a real difference. Alternatively, using a conservative test like Tukey at a 10% significance level also increases power.

Final Considerations

       ·        Scheffé's Test has excellent mathematical properties but is often considered excessively conservative for simple pairwise comparisons.

      ·        Bonferroni Correction is best suited for a small, pre-defined number of comparisons, as it becomes overly conservative with a large number of tests.

      ·        No Single Best Test: All procedures have advantages and disadvantages. While not exact, using a formal method for comparing means prevents conclusions from being entirely subjective. The researcher always has a margin of choice in both the selection of the test and the establishment of the significance level.

A Note on Software

The calculations for this guide were performed using SAS software. Results from other software packages or hand calculations may show slight differences due to rounding. Differences are typically more pronounced for the SNK test, as its critical values are less standardized across different sources.