Building a model is only half the battle. How do you know if it's actually making good predictions? This is where evaluation metrics come in. Choosing the right metric depends on your problem type: regression or classification.

Metrics for Regression

Regression models predict continuous values (e.g., price, temperature).

  • Mean Absolute Error (MAE): The average of the absolute differences between the actual and predicted values. It's easy to understand because it's in the same units as the target variable.
  • MAE=n1​i=1∑n​∣yi​−y^​i​∣
  • Mean Squared Error (MSE): The average of the squared differences. By squaring the error, it penalizes larger mistakes much more heavily than smaller ones.
  • MSE=n1​i=1∑n​(yi​−y^​i​)2
  • R-squared (R2): The coefficient of determination. It tells you the proportion of the variance in the target variable that is predictable from the features. An R2 of 0.8 means that 80% of the variation in the target can be explained by the model. It ranges from 0 to 1, with higher values being better.

Metrics for Classification

Classification models predict discrete categories (e.g., Spam/Not Spam, Cat/Dog). Here, simple accuracy can be misleading, especially with imbalanced datasets.

For example, if you have a dataset where 99% of emails are not spam, a lazy model that predicts "not spam" every time will have 99% accuracy but is completely useless. To get a better picture, we use the Confusion Matrix.

  • True Positive (TP): Actual is Positive, Predicted is Positive. (Correctly identified spam).
  • True Negative (TN): Actual is Negative, Predicted is Negative. (Correctly identified not-spam).
  • False Positive (FP): Actual is Negative, Predicted is Positive. (Type I Error. Not-spam email went to spam).
  • False Negative (FN): Actual is Positive, Predicted is Negative. (Type II Error. Spam email was missed).

From this, we derive key metrics:

  • Precision: Of all the positive predictions, how many were actually correct? It measures a model's exactness.
  • Precision=TP+FPTP​
  • Recall (Sensitivity): Of all the actual positives, how many did the model find? It measures a model's completeness.
  • Recall=TP+FNTP​
  • F1-Score: The harmonic mean of Precision and Recall. It provides a single score that balances both concerns. Use it when you care equally about Precision and Recall.
  • F1=2⋅Precision+RecallPrecision⋅Recall​
  • AUC-ROC Curve: The ROC curve plots the True Positive Rate (Recall) against the False Positive Rate at various decision thresholds. The Area Under the Curve (AUC) represents the model's ability to distinguish between the positive and negative classes. An AUC of 1.0 is a perfect classifier, while 0.5 is no better than random guessing.

Python


from sklearn.metrics import classification_report, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

# Generate sample data
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                           n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1] # Probabilities for the positive class

# Print a full report
print("Classification Report:\n")
print(classification_report(y_test, y_pred))

# Calculate AUC
auc_score = roc_auc_score(y_test, y_pred_proba)
print(f"\nAUC-ROC Score: {auc_score:.4f}")