Thursday, December 04, 2025

Split-plot vs. hierarchical design: how to choose the correct model for your experiment

 

In biological, agricultural and clinical research, we often encounter experiments where the same experimental unit is measured multiple times, or different parts of the same unit receive distinct treatments. These structures violate the assumption of independence of observations, requiring specific statistical approaches.

When the problem arises

Consider a crossover clinical trial to evaluate three treatments for cardiac arrhythmia. Each participant receives all treatments in random sequence, with washout periods between them to avoid carry-over effects. Each participant serves as a block.

 

More complex situations involve two levels of treatment: groups of units receive main treatments, while each individual unit receives multiple secondary treatments over time.

Practical example: tomato plant study

Imagine 30 tomato plants (plots) randomized to 5 fertilizer formulas (main treatments). Each plant receives two irrigation regimes (secondary treatments) in distinct periods:       

The split-plot model: when plots are heterogeneous

The split-plot design is appropriate when there is natural variability between experimental units (plots). This model explicitly considers two error levels:

·       Error (a): Variability between plots within the same main treatment

·       Error (b): Variability within plots (between subplots)

Statistical model:

ANOVA Table - Split-Plot Design                                                      

Note on F tests: In split-plot, the test for main treatments uses Residual (a) as the denominator, while the tests for secondary treatments and interaction use Residual (b). This distinction is essential for valid conclusions..

The hierarchical model: when homogeneity is assumed

In situations where plots can be considered perfectly homogeneous, the hierarchical (nested) model is more appropriate.

Practical example: coffee quality study

Evaluation of coffee quality from four different origins. From each origin, we sample four bags, and from each bag we perform three laboratory analyses:

Critical assumption: coffee within bags from the same origin is homogeneous.

Statistical model:

where tij represents the effect of the j-th secondary treatment nested within the i-th main treatment.

                                  ANOVA TABLE - nested design

Comparative Table: Split-Plot vs. Hierarchical

Practical Conclusions

1.   Choose split-plot when your plots are naturally variable biological or experimental units (animals, people, individual plants, production batches).

2.   Prefer the hierarchical model only when there is strong evidence or valid assumptions about plot homogeneity (e.g., aliquots of the same solution, subsamples of homogeneous material).

3.   Warning! Incorrect application of the hierarchical model to data with between-plot variability results in variance underestimation and falsely significant tests.

Final Considerations

The choice between these models is not merely technical but conceptual. It reflects our understanding of the nature of the experimental material and the variation structure present in the data. When in doubt, the split-plot model is generally more conservative and appropriate, as it does not assume homogeneity where it may not exist.

Historical note: This discussion dates back to classical works in experimental statistics but remains surprisingly relevant in the era of mixed models and multilevel analyses.


Wednesday, December 03, 2025

Split-plot vs. delineamento hierárquico: escolha o modelo correto para seu experimento

 

Em pesquisas biológicas, agronômicas e clínicas, frequentemente nos deparamos com experimentos nos quais a mesma unidade experimental é medida múltiplas vezes, ou diferentes partes da mesma unidade recebem tratamentos distintos. Essas estruturas violam a pressuposição de independência das observações, exigindo abordagens estatísticas específicas.

Quando o problema surge

Considere um ensaio clínico cruzado (crossover) para avaliar três tratamentos para arritmia cardíaca. Cada participante recebe todos os tratamentos em sequência aleatória, com períodos de "washout" entre eles para evitar efeito residual (carry-over effect). Cada participante funciona como um bloco. 

Situações mais complexas envolvem dois níveis de tratamento: grupos de unidades recebem tratamentos principais, enquanto cada unidade recebe diferentes tratamentos secundários.

Exemplo prático: Estudo com tomateiros

Imagine 30 tomateiros (parcelas) randomizados para 5 fórmulas de fertilizantes (tratamentos principais). Cada tomateiro recebe dois regimes de irrigação (tratamentos secundários) em períodos distintos:

O modelo split-plot: quando as parcelas são heterogêneas

O delineamento em parcelas subdivididas (split-plot design) é apropriado quando existe variabilidade natural entre as unidades experimentais (parcelas). Este modelo explicitamente considera dois níveis de erro:

    o   Erro (a): Variabilidade entre parcelas dentro do mesmo tratamento principal

    o   Erro (b): Variabilidade dentro das parcelas (entre subparcelas)

  Modelo estatístico:

         O esquema para a ANOVA é:

      O modelo hierárquico: quando a homogeneidade é pressuposta

Em situações onde as parcelas podem ser consideradas perfeitamente homogêneas, o modelo hierárquico (aninhado) é mais apropriado.

Exemplo: Avaliação da qualidade do café de quatro procedências diferentes. De cada procedência, amostramos quatro sacas, e de cada saca realizamos três análises laboratoriais:


Pressuposição crítica: o café dentro de sacas da mesma procedência é homogêneo.

Modelo estatístico:

          onde m é a média geral, gi são efeitos dos tratamentos principais,
         tij efeitos dos tratamentos secundários dentro de tratamentos principais e 
         eijk erros aleatórios independentes com distribuição normal de média zero e

  variância s2.



             O esquema para a ANOVA é: 

Tabela Comparativa: Split-Plot vs. Hierárquico

Conclusões Práticas

    1. Escolha o split-plot quando suas parcelas são unidades biológicas ou experimentais naturalmente variáveis (animais, pessoas, plantas individuais, lotes de produção).

      2. Prefira o modelo hierárquico apenas quando houver fortes evidências ou pressuposições válidas sobre a homogeneidade das parcelas (ex.: alíquotas de uma mesma solução, subamostras de material homogêneo).

        3.   Atenção! A aplicação incorreta do modelo hierárquico a dados com variabilidade entre parcelas resulta em subestimação da variância e testes falsamente significativos.

Considerações Finais

A escolha entre esses modelos não é apenas técnica, mas conceitual. Ela reflete nosso entendimento sobre a natureza do material experimental e a estrutura de variação presente nos dados. Quando em dúvida, o modelo split-plot é geralmente mais conservador e apropriado, pois não assume homogeneidade onde ela pode não existir.

Nota sobre os testes F: No split-plot, o teste para tratamentos principais usa o Resíduo (a) como denominador, enquanto os testes para tratamentos secundários e interação usam o Resíduo (b). Esta distinção é essencial para conclusões válidas.

Nota histórica: Esta discussão remonta a trabalhos clássicos da estatística experimental, mas permanece surpreendentemente relevante na era dos modelos mistos e das análises multimível.

Saturday, November 29, 2025

p-value vs.significance level: untangling the statistics that even researchers confuse

 

To understand the issue of the p-value and the significance level, we need a bit of statistical history. There are two schools of thought on hypothesis testing. The first was popularized by Ronald A. Fisher in the 1920s. Fisher saw the p-value not as part of a formal procedure for testing hypotheses, but as an informal method for seeing how surprising a dataset can be. The p-value, when combined with the researcher's knowledge of the subject and their research experience, is useful for interpreting new data.

Figure 1. The Logic of the p-value

                        Note: The p-value, in the figure, is the shaded  area under the normal distribution
                      curve representing the probability of observing a test result as extreme as or more
                      extreme than the observed value, when the null hypothesis is true.

After Fisher's work was presented, Jerzy Neyman and Egon Pearson approached the question differently. It's important to remember that in science, it is crucial to limit two types of errors: false positives (when you think something is real that isn't) and false negatives (when you think something that occurs is not real).

As an example, consider a laboratory test for diagnosing a certain disease. The test can present two types of errors: a false positive, when it says the patient is sick but they are not; and a false negative, when it says the patient is not sick but they are. In statistics, it is convention to call the false positive a Type I error and the false negative a Type II error, as shown in the scheme presented in Table 1.

·     Type I Error (false positive): when you say a treatment has an effect (affirm it) and that treatment has no effect.

·     Type II Error (false negative): when you say a treatment has no effect (deny it) and that treatment has an effect.

Table 1. Type I Error and Type II Error

False positives and false negatives are errors, but it is impossible to eliminate them entirely. If you rush to find treatment effects, you will be prone to find more false positives (i.e., commit more Type I errors); if you are conservative, not rushing to point out treatment effects, you will be prone to obtain more false negatives (commit more Type II errors).

Neyman and Pearson reasoned that, although it is impossible to eliminate false positives and false negatives, it is possible to develop a decision-making process that guarantees false positives will occur with a pre-defined probability. They called this probability the significance level, denoted by the Greek letter α. Their proposal was that researchers would define α based on their experiences and expectations. Thus, someone willing to tolerate a 10% probability of false positives would set α = 0.1, but if they needed to be more conservative, they could set α at 0.01 or less.

Figure 2. The α vs. β Trade-off

             Note: By decreasing α (making the test more rigorous), the area of H1

 that is not in the rejection region, i.e., β, increases, and vice-versa.

How does this work in practice? In the Neyman-Pearson system, once the null and alternative hypotheses are defined, the significance level α is set. Then, using a computer program, a statistical test is applied to determine the probability of obtaining a result equal to or greater than the one found in the sample, when the null hypothesis is true; that is, to determine the p-value. The Neyman-Pearson procedure consists of rejecting the null hypothesis whenever the p-value is less than or equal to the significance level α.

Therefore, unlike Fisher's procedure, this method deliberately does not use the strength of the evidence obtained in a particular experiment; it simply decides to reject the hypothesis if p ≤ α. The size of the p-value is not used to compare experiments, nor to draw conclusions beyond "The null hypothesis should (or should not) be rejected."

Although the Neyman and Pearson approach is conceptually different from Fisher's, researchers merged the two. The Neyman-Pearson approach is where we get "statistical significance" from, with a pre-chosen α value that guarantees the long-term probability of false positives. In practice, people use the Neyman-Pearson threshold (generally α = 0.05) to make a 'binary' decision, but then interpret the calculated p-value (e.g., 0.032) in the spirit of Fisher, as a measure of the strength of the evidence. This fusion is convenient but a source of much confusion.

For example, suppose you conduct an experiment and obtain a p-value = 0.032. If your threshold is the conventional α = 0.05, you have obtained a statistically significant result. It is tempting – though wrong – to say "The probability of a Type I error is 3.2%." This doesn't make sense because a single experiment does not determine a Type I error probability. You should compare your experiment to others using only the value of α.

Another consideration is necessary: we said that when you decrease one type of error, the other increases, considering the same problem being solved by the same significance test. But some tests are more powerful than others. The power of a test is defined as the probability of rejecting the null hypothesis when it is false. Therefore, the best test is the one that has the lowest probability of false negatives for a fixed value of α.

Reference
REINHART, A. Statistics Done Wrong. San Francisco, No Starch Press.
2015.

 

Tuesday, November 25, 2025

Analysis of Variance (ANOVA): Assumptions and Data Transformation

 

1.     Introduction

 

The assumptions required for an analysis of variance (ANOVA) are not always perfectly met by real-world data. However, researchers who choose this procedure need to be assured that, even if they do not fully meet the necessary assumptions (normality of residuals and homogeneity of variances), their data will still be suitable.

 

It is known that minor deviations from normality do not seriously compromise the validity of the ANOVA, especially when group sizes are equal or similar. Similarly, minor violations of homogeneity of variances have little practical relevance - except in two critical situations:

 

         1) when there is asymmetry in the residuals;

         (2) when there is positive kurtosis in the residuals.

 

In any case, the F-test remains the most powerful of the available tests provided that its assumptions are met. Otherwise, researchers should consider using non-parametric tests or resorting to data transformation. Transformations are particularly useful for stabilising the variance, but also generally help to approximate the distribution to normality.

 

2.     What does it mean to transform data? 

 

Transforming data involves applying a mathematical operation to each observation and conducting statistical analyses using the resulting values. The best-known transformations are listed below.

 

2.1.         Square Root 

 

In general, variables obtained by counting do not have a constant variance or a normal distribution. For count data (e.g. the number of insects or bacterial colonies, or the prevalence of lesions), it is recommended that the square root is applied to each observation before proceeding with ANOVA. This usually results in a more constant variance.

Practical note: If the observed values are small (less than 10) or there are many zeros, it is recommended, to avoid problems with the square root of zero, use the Anscombe transformation:

​​

or a simplified, older correction that is also effective:

before conducting the analysis.

2.2.      Logarithm

Many biological variables (such as tree height, body weight and survival time) follow a lognormal distribution. In these cases, taking the logarithm (decimal or natural) of the variable helps stabilize the variance and approximate the distribution to normality. One classic indication that this transformation is needed is when the variance of the groups increases proportionally with the mean.

2.3.      Arc sine of the square root of the proportion

If the variable is a proportion or percentage (e.g. the percentage of seeds that germinate), ANOVA can only be applied directly if the proportions strictly vary between 0.3 and 0.7. If many values fall outside this range, it is recommended that the transformation is applied.

                                              Y = arcsin(√p).

 3. Final considerations

For those unfamiliar with statistics, transforming data may seem like suspicious 'manipulation'. It is not. It is a legitimate and widely accepted technique that is often necessary when alternatives are unavailable.

Although modern software offers alternative methods, such as Welch's test for one-way analysis of variance, transforming the original variable may be the only feasible and robust approach to satisfy the model assumptions for more complex analysis of variance models, such as split-plot designs or hierarchical models.

Researchers must always be able to justify their chosen transformation and, ideally, use the most common transformation in their field of study.

Important: even if the statistical analysis was performed using transformed data, the descriptive results (means, standard errors, graphs, etc.) must be presented on the original scale of the variable. To achieve this, the transformation must be 'undone' (back-transformed) using the inverse function of the original transformation.