Fitting a linear regression model is easy. With one line of code, you can get coefficients, an R-squared value, and p-values. However, for those results to be reliable, unbiased, and trustworthy, your model must satisfy a set of assumptions. Violating these assumptions can lead to misleading or completely incorrect conclusions. This process of checking the assumptions is called regression diagnostics.
A Quick Recap: The Linear Regression Model
The goal of linear regression is to model the relationship between a dependent variable (Y) and one or more independent variables (X) with a straight line. The equation for a simple linear regression is:
Y=β0+β1X+ϵ
Where:
- β_0 is the intercept.
- β_1 is the slope (the effect of X on Y).
- ϵ (epsilon) is the error term. It represents the random, unexplained variation in Y.
The assumptions of linear regression are not about the variables themselves, but about this error term. Since we can't observe the true errors, we check the assumptions using the residuals, which are the differences between the observed values and the values predicted by our model (e=Y−Y^).
The Four Key Assumptions (LINE)
A useful mnemonic for the four main assumptions is LINE:
1. L - Linearity
Assumption: The underlying relationship between the independent variable(s) X and the dependent variable Y is linear. Why it matters: If the true relationship is curved (e.g., quadratic), fitting a straight line will result in a poor model that makes systematic errors. How to diagnose:
- Scatter Plot: Plot the independent variable (X) against the dependent variable (Y). The points should form a rough line, not a clear curve.
- Residuals vs. Fitted Plot: Plot the model's predicted values (Y^) on the x-axis and the residuals (e) on the y-axis. The points should be randomly scattered around the horizontal line at zero. A curved pattern (like a U-shape) indicates a violation.
2. I - Independence of Errors
Assumption: The errors (residuals) are independent of each other. The error for one observation should not provide any information about the error for another observation. Why it matters: This is most often violated in time-series data, where an event at one point in time can affect subsequent points (a condition called autocorrelation). Correlated errors can cause us to be overly confident in our model's significance (i.e., p-values will be artificially small). How to diagnose:
- Study Design: This is the best check. If data was collected via random sampling, the assumption is likely met.
- Durbin-Watson Test: A formal statistical test for autocorrelation. Values are between 0 and 4. A value around 2 indicates no autocorrelation. Values below 1.5 or above 2.5 are cause for concern.
3. N - Normality of Errors
Assumption: The errors (residuals) of the model are normally distributed. Why it matters: The validity of hypothesis tests (t-tests for coefficients, F-test for the model) and confidence intervals relies on this assumption. Note: Linear regression is fairly robust to violations of this assumption, especially with large sample sizes (thanks to the CLT). How to diagnose:
- Histogram of Residuals: Should look roughly like a symmetric bell curve.
- Q-Q (Quantile-Quantile) Plot: This is the best method. It plots the quantiles of your residuals against the theoretical quantiles of a normal distribution. If the residuals are normal, the points will fall closely along a 45-degree reference line.
4. E - Equal Variance of Errors (Homoscedasticity)
Assumption: The errors have constant variance at every level of the independent variable(s) X. This is called homoscedasticity. The opposite is heteroscedasticity, where the spread of the errors changes as X changes. Why it matters: If the variance is not constant, the standard errors of the coefficients will be biased, leading to incorrect conclusions from hypothesis tests. How to diagnose:
- Residuals vs. Fitted Plot: This is the same plot used to check linearity. For homoscedasticity, the vertical spread of the points should be roughly the same across the entire plot. A funnel or fan shape, where the spread increases or decreases as the fitted values change, is a classic sign of heteroscedasticity.
A Full Diagnostic Example in Python
We'll use the statsmodels library, which is superior for statistical inference and diagnostics compared to scikit-learn.
Python
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.compat import lzip
from statsmodels.stats.outliers_influence import variance_inflation_factor
import matplotlib.pyplot as plt
import seaborn as sns
# Set plotting style
sns.set_theme(style="whitegrid")
# --- 1. Load Data and Fit Model ---
# Load the Boston Housing dataset (a classic example)
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = pd.Series(housing.target, name=housing.target_names[0])
df = pd.concat([y, X], axis=1)
# We'll model MedInc (Median Income) to predict MedHouseVal
X_subset = df[['MedInc', 'HouseAge', 'AveRooms']]
y = df['MedHouseVal']
X_subset_const = sm.add_constant(X_subset) # Add an intercept
# Fit the Ordinary Least Squares (OLS) model
model = sm.OLS(y, X_subset_const).fit()
print(model.summary())
# --- 2. Create Residuals ---
fitted_vals = model.predict()
residuals = model.resid
# --- 3. Run Diagnostics ---
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Regression Diagnostics', fontsize=20)
# A. Linearity and Homoscedasticity (Residuals vs. Fitted)
sns.residplot(x=fitted_vals, y=residuals, lowess=True,
              line_kws={'color': 'red', 'lw': 2}, ax=ax1)
ax1.set_title('Residuals vs. Fitted Plot', fontsize=14)
ax1.set_xlabel('Fitted Values')
ax1.set_ylabel('Residuals')
# B. Normality of Errors (Q-Q Plot)
sm.qqplot(residuals, line='45', fit=True, ax=ax2)
ax2.set_title('Normal Q-Q Plot', fontsize=14)
# C. Scale-Location Plot (another homoscedasticity check)
sqrt_abs_resid = np.sqrt(np.abs(residuals))
sns.regplot(x=fitted_vals, y=sqrt_abs_resid, lowess=True,
            line_kws={'color': 'red', 'lw': 2}, ax=ax3, ci=None)
ax3.set_title('Scale-Location Plot', fontsize=14)
ax3.set_xlabel('Fitted Values')
ax3.set_ylabel('Root of Standardized Residuals')
# D. Independence of Errors (Durbin-Watson from summary)
# The summary() output contains the Durbin-Watson statistic.
# Let's just print it for clarity.
dw_stat = sm.stats.stattools.durbin_watson(residuals)
ax4.text(0.1, 0.5, f'Durbin-Watson Statistic: {dw_stat:.2f}\n(Values near 2 are good)', fontsize=14)
ax4.set_title('Independence of Errors', fontsize=14)
ax4.axis('off')
plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()
# E. Check for Multicollinearity (VIF)
vif_data = pd.DataFrame()
vif_data["feature"] = X_subset_const.columns
vif_data["VIF"] = [variance_inflation_factor(X_subset_const.values, i) for i in range(X_subset_const.shape[1])]
print("\nVariance Inflation Factor (VIF):")
print(vif_data)
The output of this code provides a comprehensive diagnostic report. You would examine the plots and VIF values to assess if the assumptions are met. If they aren't (e.g., you see a clear curve in the residuals plot), you might need to transform your variables (e.g., take the log of Y) or consider a more complex model.