Linear Regression is a fundamental algorithm in machine learning. Its goal is to model a linear relationship between a dependent variable (what you want to predict) and one or more independent variables (the features).
The core idea is to find the "best-fit" line that describes the data. This line is represented by the simple equation:
y=β0+β1x1+⋯+βnxn+ϵ
Where:
- y is the target value we want to predict.
- x_1,…,x_n are the feature values.
- β_1,…,β_n are the model coefficients (weights). Each coefficient represents the impact of its corresponding feature on the prediction.
- β_0 is the intercept (the value of y when all features are zero).
- ϵ is the error term.
Ordinary Least Squares (OLS)
How do we find the "best-fit" line? OLS is the most common method. It works by finding the coefficients (β values) that minimize the sum of the squared differences between the actual values and the predicted values. This difference is called the residual.
In essence, OLS penalizes larger errors more heavily, trying to draw a line that is as close as possible to all data points simultaneously.
Python
# Python code with scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
# Sample Data: Predicting house price based on square footage
X = np.array([[1400], [1600], [1700], [1875], [1100], [1550]]) # Square footage
y = np.array([245000, 312000, 279000, 308000, 199000, 219000]) # Price
# Create and train the model
model = LinearRegression()
model.fit(X, y)
# Predict a new value
new_sqft = [[1750]]
predicted_price = model.predict(new_sqft)
print(f"Intercept (beta_0): {model.intercept_}")
print(f"Coefficient for sqft (beta_1): {model.coef_[0]}")
print(f"Predicted price for {new_sqft[0][0]} sqft: ${predicted_price[0]:,.2f}")
The Problem: Overfitting
Sometimes, a model learns the training data too well. It captures not only the underlying patterns but also the noise. This is called overfitting. An overfit model performs great on the data it was trained on but fails to generalize to new, unseen data. In linear regression, this often happens when you have many features, and the model assigns very large coefficients to some of them.
The Solution: Regularization
Regularization is a technique used to prevent overfitting by adding a penalty term to the cost function. This penalty discourages the model from assigning overly large coefficients. The two most common types are:
- Ridge Regression (L2 Regularization): Adds a penalty proportional to the square of the magnitude of the coefficients. It shrinks large coefficients towards zero but rarely makes them exactly zero.
- Lasso Regression (L1 Regularization): Adds a penalty proportional to the absolute value of the magnitude of the coefficients. Lasso can shrink some coefficients all the way to zero, effectively performing feature selection.
Python
from sklearn.linear_model import Ridge, Lasso
# Ridge Regression
ridge_model = Ridge(alpha=1.0) # alpha is the regularization strength
ridge_model.fit(X, y)
print(f"\nRidge Coefficient: {ridge_model.coef_[0]}")
# Lasso Regression
lasso_model = Lasso(alpha=1.0) # alpha is the regularization strength
lasso_model.fit(X, y)
print(f"Lasso Coefficient: {lasso_model.coef_[0]}")