Ensemble learning is based on a simple but powerful idea: the wisdom of the crowd. Instead of relying on a single, complex model, we can combine the predictions from several "weaker" models to make a better final prediction. This often leads to models that are more accurate and less prone to overfitting.

Bagging: Bootstrap Aggregating

Bagging is one of the simplest and most effective ensemble techniques. It aims to reduce the variance of a model. Here’s how it works:

  1. Bootstrap: Create many random subsamples of the original training dataset. These samples are drawn with replacement, meaning the same data point can appear multiple times in a single sample. Each sample is the same size as the original dataset.
  2. Aggregate: Train a separate model (often a Decision Tree) on each of these bootstrap samples.
  3. Vote/Average: To make a prediction for a new data point, get a prediction from each model.
  • For classification, take a majority vote.
  • For regression, take the average of all predictions.

By training on different subsets of data, each individual tree learns slightly different patterns. Averaging their outputs cancels out the noise and reduces the overall variance, making the final model more stable.

[Image illustrating the Bagging process]

Random Forests: Bagging with a Twist

A Random Forest is an extension of Bagging that is specifically designed for Decision Trees. It adds one extra step to further increase the diversity of the models and reduce correlation between the trees:

At each split in a tree, the algorithm considers only a random subset of the available features.

So, a Random Forest combines two layers of randomness:

  1. Row Sampling (like Bagging): Each tree is built on a different bootstrap sample of the data.
  2. Feature Sampling: At each node, the tree can only choose the best split from a small, random subset of features (e.g., if you have 10 features, it might only consider 3 of them for a particular split).

This extra randomness forces the trees to be different from one another. If one feature is very predictive, Bagging might create many similar trees that all use this feature at the top. Random Forest prevents this by not even allowing all trees to see that feature at every split, leading to a more robust and powerful ensemble.

Python


# Python code with scikit-learn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score

# Generate some sample data
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_redundant=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train the Random Forest model
# n_estimators is the number of trees in the forest
# max_features='sqrt' is a common choice for the size of the feature subset
model = RandomForestClassifier(n_estimators=100, max_features='sqrt', random_state=42, n_jobs=-1)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Random Forest Accuracy: {accuracy:.4f}")

# You can also see which features were most important
import pandas as pd
feature_imp = pd.Series(model.feature_importances_, index=[f'feature_{i}' for i in range(X.shape[1])]).sort_values(ascending=False)
print("\nTop 5 Most Important Features:")
print(feature_imp