When you build a machine learning model, you need to estimate how well it will perform on new, unseen data. The standard approach is to split your data into a training set and a testing set. But what if, by pure chance, your test set contains only the "easy" examples? Or only the "hard" ones? Your performance estimate could be overly optimistic or pessimistic.

Cross-Validation (CV) is a robust technique that solves this problem by using every part of your data for both training and testing.

K-Fold Cross-Validation

This is the most common type of cross-validation.

The Process:

  1. Split: Shuffle your dataset and split it into 'k' equal-sized parts, or folds. A common choice for 'k' is 5 or 10.
  2. Iterate: For each fold, perform the following:
  • Use that single fold as the hold-out test set.
  • Use the remaining k-1 folds as the training set.
  • Train your model and evaluate its performance on the test set.
  1. Average: You will end up with 'k' different performance scores. The final performance of your model is the average of these scores.

This gives you a much more stable and reliable estimate of your model's performance, as it has been tested on all the data.

Stratified K-Fold

In a classification problem, especially with an imbalanced dataset, a random K-Fold split might result in some folds having very few (or even zero) samples of the minority class. This would make the evaluation in that fold meaningless.

Stratified K-Fold solves this by ensuring that each fold has the same percentage of samples for each target class as the complete dataset. For classification problems, this is almost always the preferred method over standard K-Fold.

Time Series Split

When working with time series data (e.g., stock prices, daily sales), you cannot use random splits. Doing so would cause data leakage, where you train your model on data from the future to predict the past, which is impossible in the real world.

Time Series Cross-Validation respects the temporal order. The training set for each split always consists of data points that occurred before the data points in the test set.

Python


from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

# Generate sample imbalanced data
X, y = make_classification(n_samples=100, n_features=20, n_informative=10,
                           n_classes=2, weights=[0.9, 0.1], random_state=42)

# Create a model
model = RandomForestClassifier(random_state=42)

# --- Standard K-Fold ---
# Note: This is NOT ideal for this imbalanced data, shown for comparison
kf = KFold(n_splits=5, shuffle=True, random_state=42)
kfold_scores = cross_val_score(model, X, y, cv=kf, scoring='roc_auc')
print(f"Standard K-Fold AUC scores: {kfold_scores.round(2)}")
print(f"Average K-Fold AUC: {kfold_scores.mean():.4f}")

# --- Stratified K-Fold ---
# This is the CORRECT way for this classification problem
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
stratified_scores = cross_val_score(model, X, y, cv=skf, scoring='roc_auc')
print(f"\nStratified K-Fold AUC scores: {stratified_scores.round(2)}")
print(f"Average Stratified K-Fold AUC: {stratified_scores.mean():.4f}")