A Decision Tree is a versatile and highly interpretable machine learning algorithm. It works by splitting the data into smaller and smaller subsets based on feature values, creating a tree-like structure of decisions.

Each part of the tree has a name:

  • Root Node: The top-most node, representing the entire dataset.
  • Decision Node: A node that splits into two or more sub-nodes. It represents a question or a test on a feature.
  • Leaf/Terminal Node: A node that does not split further. It represents the final decision or predicted outcome.

How Does a Tree Decide Where to Split?

The main goal is to make splits that create the "purest" possible child nodes. A pure node is one where all (or most) of the data points belong to the same class. To measure this purity, we use criteria like:

  1. Gini Impurity: Measures the probability of incorrectly classifying a randomly chosen element if it were randomly labeled according to the distribution of labels in the subset. A Gini score of 0 is perfect purity (all elements belong to one class).
  2. Gini=1−i=1∑C​(pi​)2
  3. where p_i is the probability of class i.
  4. Entropy: A measure of information or disorder. Entropy is 0 for a pure node and maximal when the classes are perfectly mixed. The algorithm aims to make splits that result in the largest Information Gain (reduction in entropy).
  5. Entropy=−i=1∑C​pi​log2​(pi​)

The tree algorithm calculates the purity score for every possible split on every feature and chooses the split that results in the purest child nodes.

The Overfitting Problem and Pruning

If left to grow unchecked, a decision tree can become extremely deep and complex. It will create a specific path for every single data point in the training set. This results in a model that is overfit—it has 100% accuracy on the training data but fails to generalize to new data.

To combat this, we use pruning:

  • Pre-Pruning (Early Stopping): We stop the tree from growing before it becomes too complex. We can do this by setting constraints, such as:
  • max_depth: The maximum number of levels in the tree.
  • min_samples_split: The minimum number of samples a node must have before it can be split.
  • min_samples_leaf: The minimum number of samples required to be at a leaf node.
  • Post-Pruning: We first let the tree grow to its full depth and then remove branches that provide little predictive power.

Python


# Python code with scikit-learn
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt
import numpy as np

# Sample Data: Predicting if a loan should be approved
# Features: [Credit Score (out of 850), Income (in 1000s)]
# Label: 0=Rejected, 1=Approved
X = np.array([[650, 50], [720, 80], [580, 30], [750, 120], [690, 90], [610, 40]])
y = np.array([0, 1, 0, 1, 1, 0])

# Create and train the model with pre-pruning
# We set max_depth to 2 to prevent overfitting
model = DecisionTreeClassifier(criterion='gini', max_depth=2, random_state=42)
model.fit(X, y)

# Let's visualize the tree
plt.figure(figsize=(10,7))
plot_tree(model, filled=True, feature_names=['Credit Score', 'Income'], class_names=['Rejected', 'Approved'])
plt.show()

# Predict for a new applicant
new_applicant = [[700, 60]]
prediction = model.predict(new_applicant)
print(f"The loan for applicant {new_applicant[0]} is: {'Approved' if prediction[0] == 1 else 'Rejected'}")

Quiz