A simple example with numbered balls helps us understand why we use "n - 1" in the sample variance formula and how the standard error of the mean emerges. Includes charts for easy visualization.
To
explain the standard error of the mean and present the logic behind the degrees
of freedom in sample variance, let's use an unrealistic—but incredibly
useful—example.
Imagine
an urn containing three numbered balls: 4, 10, and 16. A player draws a ball,
notes the number, returns the ball to the urn, and makes a second draw. They
note this number as well and then calculate the average of the two numbers,
which is their score.
From a
theoretical standpoint, we have an infinite population of numbered balls (since
the ball is returned after each draw), and this game can be repeated
indefinitely. The population mean (μ) is:
As xᵢ can be 4, 10, or 16, each with a probability pᵢ = 1/3, we get:
We have the population mean μ, which is a parameter. Therefore, the population variance around this mean does not involve degrees of freedom. The dispersion of the variable around the mean μ is given by:
Now,
let's analyze the possible game outcomes. Since both the first and second draw
can result in a 4, 10, or 16, there are 9 possible combinations. Table 1 shows
all these possible samples of size 2, their means, and their variances.
Table 1:
All possible samples of size two from the population {4, 10, 16}, with their
respective means and variances.
By
observing Table 1, we note two crucial facts:
1. The average of all possible sample means is equal
to the population mean.
2. The average of all possible sample variances is
equal to the population variance, but only because the sample
variances were calculated using the divisor *n-1* (i.e., the sample
variance formula).
Therefore,
we say that:
✅ The sample mean is an unbiased
estimator of the population mean.
✅ The sample variance (with
divisor *n-1*) is an unbiased estimator of the population
variance.
🔔 Why "unbiased"? Because these estimates, on average across all possible samples, hit the true values of the population parameters.
In our example, the sample means have different probabilities:
🔸 Means of 4 and 16: each occurs once (Probability: 1/9)
🔸 Means of 7 and 13: each occurs twice (Probability: 2/9)
🔸 Mean of 10: occurs three times (Probability: 3/9)
The weighted average of the sample means, using their probabilities, is:
Since
the sample means are distributed around the population mean, we can measure
this dispersion: this is the variance of the sample mean, given
by Var(x̄). Using the values from Table 1:
🔔 In
practice, however, we don't have access to all possible samples—a researcher
usually only has one sample. Even so, it's possible to estimate the variance of
the mean using the formula:
·
s² is the sample variance (calculated with
divisor *n-1*),
·
*n* is the sample size.
This formula allows us to estimate the expected variability of the sample mean if the study were repeated many times. The standard deviation of the sample mean, known as the standard error of the mean (SEM), is then:
SEM = √(s² / n) = s / √n
This
value tells us how much the mean from a single sample is likely to vary from
the true population mean. It is a fundamental concept for constructing
confidence intervals and conducting hypothesis tests.
✅ Conclusion
The
simple example we developed, with numbered balls and all possible sample
combinations, shows how often-abstract statistical concepts—like standard
error, sample variance, and degrees of freedom—have a clear and visual logical
foundation. By calculating all possible means and variances, we can understand
why the sample mean is a good estimate of the population mean and why variance
must be divided by *n-1* to avoid underestimating the true
variability.
More
than memorizing formulas, it's crucial to understand the intuition behind them.
And there's nothing better than a small, complete example, accompanied by
charts, to turn theory into understanding.
Translation & Editing Note: This
post was translated from Portuguese and edited for clarity with the assistance
of an AI language model. The statistical methodology, calculations, and
conclusions were rigorously verified by the author.
No comments:
Post a Comment