Naive Bayes: Probabilistic Classification for Text

The Naive Bayes classifier is a probabilistic machine learning algorithm based on the famous Bayes' Theorem. It's particularly useful for problems involving text data, like classifying emails as spam or not spam.

Bayes' Theorem

At its heart, the algorithm uses Bayes' Theorem to calculate the probability of a hypothesis given some evidence. The formula is:

P(A∣B)=P(B)P(B∣A)⋅P(A)

In the context of classification, we can think of this as:

P(class∣features)=P(features)P(features∣class)⋅P(class)

P(class∣features): The probability of a certain class given the features of a data point (this is what we want to find).
P(features∣class): The probability of seeing these features, given that it belongs to a certain class.
P(class): The overall probability of this class (its frequency in the training data).
P(features): The overall probability of seeing these features.

The "Naive" Assumption

Here's where the "naive" part comes in. To calculate P(features∣class), we would normally have to consider the complex relationships between all the features. The Naive Bayes classifier simplifies this dramatically by making a strong assumption of conditional independence.

The Naive Assumption: All features are independent of each other, given the class.

For text classification, this means the algorithm assumes that the presence of the word "buy" in an email is completely independent of the presence of the word "now", given that the email is spam.

Of course, this is rarely true in reality ("buy" and "now" often appear together). However, the algorithm works surprisingly well in practice despite this unrealistic assumption, especially for text problems. This assumption makes the calculations extremely fast and efficient.

Example: Spam Filtering

Let's say we want to classify an email with the words "Viagra" and "free" as spam. The algorithm would calculate:

Probability it's Spam:

What's the probability of "Viagra" appearing in spam emails?
What's the probability of "free" appearing in spam emails?
What's the overall probability of any email being spam?
Multiply these probabilities together.

Probability it's Not Spam (Ham):

What's the probability of "Viagra" appearing in ham emails?
What's the probability of "free" appearing in ham emails?
What's the overall probability of any email being ham?
Multiply these probabilities together.

Compare: The email is assigned to the class with the higher resulting probability.

Python

# Python code with scikit-learn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Sample Data: Classifying messages as spam or not spam (ham)
corpus = [
    'Offer for you! Win a free prize now!',
    'Can we meet tomorrow for the project?',
    'Urgent: your account needs attention',
    'Hey, are you free for lunch tomorrow?',
    'Claim your free prize today only'
]
labels = ['spam', 'ham', 'spam', 'ham', 'spam']

# Create a pipeline that first converts text to word counts (vectorizes)
# and then applies the Naive Bayes classifier.
pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', MultinomialNB())
])

# Train the model
pipeline.fit(corpus, labels)

# Predict on new text
new_messages = [
    'Can you send me the project file?',
    'Win a prize by clicking now'
]
predictions = pipeline.predict(new_messages)

for msg, pred in zip(new_messages, predictions):
    print(f"'{msg}' -> Predicted: {pred}")

LearnCodePro

Naive Bayes: Probabilistic Classification for Text

Bayes' Theorem

The "Naive" Assumption

Example: Spam Filtering

Mastering Linear Regression: From OLS to Regularization

Logistic Regression: Your First Step in Classification

k-Nearest Neighbors (kNN): The "Power of Friendship" Algorithm

Decision Trees: Making Choices Like a Flowchart

Ensemble Power: Bagging and Random Forests

Quick Navigation

This Series

Topics in Data Science, Machine Learning & AI

Categories

Learn More

Want to Track Your Progress?

Bayes' Theorem

The "Naive" Assumption

Example: Spam Filtering

More in Supervised Algorithms (each as mini-tutorial set)

Mastering Linear Regression: From OLS to Regularization

Logistic Regression: Your First Step in Classification

k-Nearest Neighbors (kNN): The "Power of Friendship" Algorithm

Decision Trees: Making Choices Like a Flowchart

Ensemble Power: Bagging and Random Forests

Quick Navigation

This Series

Topics in Data Science, Machine Learning & AI

Categories

Learn More

Want to Track Your Progress?