— Simple Linear Regression: A Fundamental Tool for Data Analysis and Machine Learning

1.1. Why Study Regression Today

Even in the era of machine learning — with complex algorithms like deep neural networks and gradient boosting dominating the conversation — linear regression remains indispensable. It is:

· A baseline model to compare against more sophisticated techniques.

· Interpretable, allowing us to understand how each variable influences the result.

· A conceptual foundation for modern methods such as penalized regression (Lasso, Ridge) and generalized linear models.

· An essential tool for communicating results clearly to non-specialists.

Data Science Note: In many projects, we test a simple linear regression as a baseline before moving on to complex models. If it already delivers good accuracy, more elaborate approaches may be unnecessary.

1.2. Presenting the Problem

A manager should know that strategic decisions need to be backed by data. But can they interpret a statistical analysis or judge whether a model makes sense?

You don’t have to perform long calculations by hand — tools like Excel, R, and Python handle that. What you do need is to understand the reasoning behind the numbers.

Basic Concepts:

· Dependent variable (target, response): what we want to understand or predict.

· Independent variables (features, explanatory): factors we believe influence the dependent variable.

Simple regression: uses a single independent variable.

Multiple regression: uses two or more independent variables.

Practical examples:

· Predicting weight of adults based only on height → simple regression.

· Predicting weight of children based on age and height → multiple regression.

· Studying the effect of sedentary lifestyle, smoking, and diet on the risk of heart disease → multiple regression.

1.3. How Regression Is Used

Organizations apply regression to:

· Explain phenomena: “Why did customer service calls drop last month?”

· Predict the future: “What will sales be in the next quarter?”

· Support decisions: “Should we keep this marketing campaign?”

· Regression answers key questions:

· Which variables have the greatest impact?

· Which variables can be ignored?

· How much of the variation in the dependent variable does the model explain?

1.4. Example: Predicting House Prices

Imagine you move to a new city and want to buy a house. A sample of nine houses for sale provides the following data:

1.5. Prediction Using the Mean

The mean house price is:

It’s a starting point, but far from precise: the real price could vary widely.

Limitation of the Mean as Predictor: In our example, the 95% confidence interval for the mean price ranged from R$ 95,000 to R$ 586,000 — too broad for practical decision-making.

1.6. Prediction Using an Independent Variable

To improve accuracy, we choose the variable most correlated with price:

1.7. Fitting the Regression Line

The fitted equation is:

Ŷ = 145.42 + 0.9674 · X2.

The slope indicates the average increase in price (in thousands of R$) for each additional square meter of lot size.

1.8. Basic Assumptions

· Linearity: the average relationship between Y and X is linear.

· Homoscedasticity: residual variance is constant.

· Independence: observations don’t influence each other.

· Normality of residuals: important for significance testing.

Using the standard error of estimate and confidence intervals, we predict the price of a house with a 200 m² lot:

95% CI = [R$ 209,000, R$ 469,000].

This is far narrower than using the mean alone.

1.10. Links to Modern Methods

What we’ve learned here underpins many current models:

· Multiple regression: adds more explanatory variables.

· Regularized regression (Ridge, Lasso): controls for too many variables.

· Decision trees and neural networks: replace the straight line with more complex functions, but still adjust parameters to minimize error.

In machine learning, this process is called training the model.

Summary

· The mean is a simple but often imprecise predictor.

· If Y depends on X, regression improves predictions.

· Simple linear regression fits a straight line to describe the relationship between two variables.

· We can estimate both point predictions and confidence intervals.

· This is the foundation for more advanced modeling techniques.

Sonia Vieira

Sunday, August 10, 2025