1.1. Why Study
Regression Today
Even in the era of
machine learning — with complex algorithms like deep neural networks and
gradient boosting dominating the conversation — linear regression remains
indispensable. It is:
· A
baseline model to compare against more sophisticated techniques.
· Interpretable,
allowing us to understand how each variable influences the result.
· A
conceptual foundation for modern methods such as penalized regression (Lasso,
Ridge) and generalized linear models.
· An
essential tool for communicating results clearly to non-specialists.
Data
Science Note: In many projects, we test a simple linear regression as a
baseline before moving on to complex models. If it already delivers good
accuracy, more elaborate approaches may be unnecessary.
1.2. Presenting the
Problem
A manager should know
that strategic decisions need to be backed by data. But can they interpret a
statistical analysis or judge whether a model makes sense?
You don’t have to
perform long calculations by hand — tools like Excel, R, and Python handle
that. What you do need is to understand the reasoning behind the numbers.
Basic
Concepts:
· Dependent
variable (target, response): what we want to understand or predict.
· Independent
variables (features, explanatory): factors we believe influence the dependent
variable.
Simple regression:
uses a single independent variable.
Multiple regression:
uses two or more independent variables.
Practical
examples:
· Predicting
weight of adults based only on height → simple regression.
· Predicting
weight of children based on age and height → multiple regression.
· Studying
the effect of sedentary lifestyle, smoking, and diet on the risk of heart
disease → multiple regression.
1.3. How Regression
Is Used
Organizations apply
regression to:
· Explain
phenomena: “Why did customer service calls drop last month?”
· Predict
the future: “What will sales be in the next quarter?”
· Support
decisions: “Should we keep this marketing campaign?”
· Regression
answers key questions:
· Which
variables have the greatest impact?
· Which
variables can be ignored?
· How
much of the variation in the dependent variable does the model explain?
1.4. Example:
Predicting House Prices
Imagine you move to a
new city and want to buy a house. A sample of nine houses for sale provides the
following data:
1.5. Prediction Using
the Mean
The mean house price is:
It’s
a starting point, but far from precise: the real price could vary widely.
Limitation
of the Mean as Predictor: In our example, the 95% confidence interval for the
mean price ranged from R$ 95,000 to R$ 586,000 — too broad for practical
decision-making.
1.6. Prediction Using
an Independent Variable
To improve accuracy,
we choose the variable most correlated with price:
The fitted equation is:
Ŷ = 145.42 + 0.9674 · X2.
The slope indicates the average increase in price (in thousands of R$) for
each additional square meter of lot size.
1.8. Basic
Assumptions
· Linearity:
the average relationship between Y and X is linear.
· Homoscedasticity:
residual variance is constant.
· Independence: observations don’t influence each other.
· Normality of residuals: important for significance testing.
Using the standard error of estimate and confidence intervals, we predict the price of a house with a 200 m² lot:
95% CI = [R$ 209,000, R$ 469,000].
This is far narrower than
using the mean alone.
1.10. Links to Modern
Methods
What we’ve learned
here underpins many current models:
· Multiple
regression: adds more explanatory variables.
· Regularized regression (Ridge, Lasso): controls for too many variables.
· Decision trees and neural networks: replace the straight line with more complex functions, but still adjust parameters to minimize error.
In machine learning,
this process is called training the model.
Summary
· The
mean is a simple but often imprecise predictor.
· If
Y depends on X, regression improves predictions.
· Simple
linear regression fits a straight line to describe the relationship between two
variables.
· We
can estimate both point predictions and confidence intervals.
· This
is the foundation for more advanced modeling techniques.