Why do overly
simple models fail... and overly complex ones too?
Anyone who has tried to fit a model to experimental data knows the dilemma:
1. If the model is too simple, it fails to represent the data.
2. If the model is too complex, it overfits the data and loses
generalizability.
This dilemma is known as the bias-variance tradeoff.
📉 Bias:
when the model misses the essentials
Bias refers to
using a model that is too simple to describe a phenomenon that is actually
complex. The model underestimates the phenomenon and ignores important patterns
in the data.
Bias is a measure of the difference between the average model prediction and
the true value to be predicted.
Example: Fitting a straight line to a set of points that clearly follows a
curve.
Consequence: the model systematically errs, even with plenty of data. This is
called underfitting.
📈
Variance: when the model overreacts to the data
If small changes
in the training data lead to large changes in the fitted model, we say the
model has high variance.
Variance is a measure of the model’s sensitivity to the training data: how much
the model changes when the data changes.
Example: Fitting a degree-10 polynomial to a small data set. The model hits
every point, but makes wild curves.
Consequence: the model performs well on training data but fails on new data.
This is called overfitting.
🎯
Finding the optimal point
Every model has
some error, which can be broken down into three components:
1. Bias error → the model is too simple and inadequate;
2. Variance error → the model varies too much between samples;
3. Irreducible noise → natural variability in the data that no model can
eliminate.
The goal is to find a model that balances bias and variance, minimizing total
error. This is the bias-variance tradeoff — a situation in which improving one
aspect often worsens the other.
⚙️ Comparing
models
Model |
Bias |
Variance |
Low |
Medium |
|
Medium |
Low |
|
|
|
|
Low |
High |
🌽
Example: corn yield under different phosphorus levels
Suppose:
- Five phosphorus levels: 20, 40, 60, 80, 100 kg/ha;
- Mean yields from ten replicates: 2.8 – 3.2 – 3.4 – 3.3 – 1.8 tons/ha.
What should be done?
It depends on the modeling goal and the chosen method:
🔹
Approach 1: Traditional ANOVA (categorical model)
Advantages:
- Simple and interpretable;
- No need to assume a functional relationship between levels.
Disadvantages:
- Does not capture trends;
- Higher variance if the data are noisy;
- No predictive power beyond the observed levels.
🔹
Approach 2: Linear regression (numeric model)
Advantages:
- More parsimonious: just two parameters;
- Allows interpolation and extrapolation.
Disadvantages:
- High bias if the relationship is nonlinear;
- Misses real differences if they are nonlinear.
🔹
Approach 3: Polynomials (more flexible model)
Advantages:
- Captures nonlinear effects;
- May better represent the real pattern.
Disadvantages:
- Higher variance;
- Risk of overfitting — too complex for the data set.
🔚 Closing thought
Between the
simplicity that misses the phenomenon and the complexity that gets lost in
noise, there is a point of balance. The challenge of modeling lies in
recognizing it — with data, sound judgment, and theory.
No comments:
Post a Comment