Sunday, August 10, 2025

— Simple Linear Regression: A Fundamental Tool for Data Analysis and Machine Learning

 

1.1. Why Study Regression Today


Even in the era of machine learning — with complex algorithms like deep neural networks and gradient boosting dominating the conversation — linear regression remains indispensable. It is:

·       A baseline model to compare against more sophisticated techniques.

·       Interpretable, allowing us to understand how each variable influences the result.

·       A conceptual foundation for modern methods such as penalized regression (Lasso, Ridge) and generalized linear models.

·       An essential tool for communicating results clearly to non-specialists.

Data Science Note: In many projects, we test a simple linear regression as a baseline before moving on to complex models. If it already delivers good accuracy, more elaborate approaches may be unnecessary.

1.2. Presenting the Problem


A manager should know that strategic decisions need to be backed by data. But can they interpret a statistical analysis or judge whether a model makes sense?

You don’t have to perform long calculations by hand — tools like Excel, R, and Python handle that. What you do need is to understand the reasoning behind the numbers.

Basic Concepts:

·       Dependent variable (target, response): what we want to understand or predict.

·       Independent variables (features, explanatory): factors we believe influence the dependent variable.

Simple regression: uses a single independent variable.

Multiple regression: uses two or more independent variables.

Practical examples:

·       Predicting weight of adults based only on height → simple regression.

·       Predicting weight of children based on age and height → multiple regression.

·       Studying the effect of sedentary lifestyle, smoking, and diet on the risk of heart disease → multiple regression.

1.3. How Regression Is Used


Organizations apply regression to:

·       Explain phenomena: “Why did customer service calls drop last month?”

·       Predict the future: “What will sales be in the next quarter?”

·       Support decisions: “Should we keep this marketing campaign?”

·       Regression answers key questions:

·       Which variables have the greatest impact?

·       Which variables can be ignored?

·       How much of the variation in the dependent variable does the model explain?

1.4. Example: Predicting House Prices


Imagine you move to a new city and want to buy a house. A sample of nine houses for sale provides the following data:

 

1.5. Prediction Using the Mean


The mean house price is:

 Its a starting point, but far from precise: the real price could vary widely.

Limitation of the Mean as Predictor: In our example, the 95% confidence interval for the mean price ranged from R$ 95,000 to R$ 586,000 — too broad for practical decision-making.

1.6. Prediction Using an Independent Variable


To improve accuracy, we choose the variable most correlated with price:

                                   
               1.7. Fitting the Regression Line

The fitted equation is: 

Ŷ = 145.42 + 0.9674 · X2. 

The slope indicates the average increase in price (in thousands of R$) for each additional square meter of lot size.

1.8. Basic Assumptions


·       Linearity: the average relationship between Y and X is linear.

·       Homoscedasticity: residual variance is constant.

·       Independence: observations don’t influence each other.

·       Normality of residuals: important for significance testing.


Using the standard error of estimate and confidence intervals, we predict the price of a house with a 200 m² lot: 

95% CI = [R$ 209,000, R$ 469,000]. 

This is far narrower than using the mean alone.

1.10. Links to Modern Methods


What we’ve learned here underpins many current models:

·       Multiple regression: adds more explanatory variables.

·       Regularized regression (Ridge, Lasso): controls for too many variables.

·       Decision trees and neural networks: replace the straight line with more complex functions, but still adjust parameters to minimize error.


In machine learning, this process is called training the model.

 Summary


·       The mean is a simple but often imprecise predictor.

·       If Y depends on X, regression improves predictions.

·       Simple linear regression fits a straight line to describe the relationship between two variables.

·       We can estimate both point predictions and confidence intervals.

·       This is the foundation for more advanced modeling techniques.


No comments: