Most classification models don't just output a final class label (like "Spam" or "Not Spam"). They first predict a probability—a score between 0 and 1 representing the model's confidence. By default, we use a decision threshold of 0.5 to convert this probability into a class label. But is this always the best approach?

Model Calibration: Are the Probabilities Meaningful?

A model is considered well-calibrated if its predicted probabilities reflect the true likelihood of an event. For example, if you gather all the instances where your model predicted a probability between 0.7 and 0.8, you would expect the positive class to actually occur in about 75% of those instances.

Some models, like Logistic Regression, tend to be well-calibrated out-of-the-box. Others, like Support Vector Machines (SVMs) or Naive Bayes, can produce distorted probabilities.

We can visualize this using a calibration curve (or reliability diagram). It plots the actual frequency of the positive class against the predicted probability. For a perfectly calibrated model, this plot would be a straight diagonal line.

Threshold Tuning: Aligning with Business Goals

The default decision threshold of 0.5 treats all types of errors equally. But in the real world, the cost of a False Positive is often very different from the cost of a False Negative.

Example: Credit Card Fraud Detection

  • False Negative: A fraudulent transaction is missed. Cost: The bank loses the full amount of the fraudulent transaction. (Very High Cost)
  • False Positive: A legitimate transaction is flagged as fraud. Cost: A customer is inconvenienced. (Low Cost)

In this case, we want to catch as much fraud as possible, even if it means flagging a few legitimate transactions by mistake. We need to prioritize high recall. We can achieve this by lowering the decision threshold from 0.5 to, say, 0.2. This makes the model more sensitive, classifying more transactions as fraudulent.

Conversely, in a scenario like email marketing, you might want to be very sure a user is interested before sending them an offer. This would mean prioritizing high precision, which could be achieved by raising the threshold.

By plotting precision and recall curves against different thresholds, you can visually identify the optimal threshold that meets your specific business objective.

Python


from sklearn.metrics import precision_recall_curve, PrecisionRecallDisplay
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt

# Generate sample data
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                           weights=[0.8, 0.2], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Train a model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Plot Precision-Recall curve
display = PrecisionRecallDisplay.from_estimator(model, X_test, y_test, name="RandomForest")
_ = display.ax_.set_title("Precision-Recall curve")
plt.show()

# Find threshold that gives a specific recall, e.g., 80%
y_scores = model.predict_proba(X_test)[:, 1]
precision, recall, thresholds = precision_recall_curve(y_test, y_scores)

# Find the first threshold where recall is >= 0.80
idx = next(i for i, r in enumerate(recall) if r >= 0.80)
print(f"To achieve at least 80% recall, the threshold should be ~{thresholds[idx]:.2f}")