The Naive Bayes classifier is a probabilistic machine learning algorithm based on the famous Bayes' Theorem. It's particularly useful for problems involving text data, like classifying emails as spam or not spam.
Bayes' Theorem
At its heart, the algorithm uses Bayes' Theorem to calculate the probability of a hypothesis given some evidence. The formula is:
P(A∣B)=P(B)P(B∣A)⋅P(A)
In the context of classification, we can think of this as:
P(class∣features)=P(features)P(features∣class)⋅P(class)
- P(class∣features): The probability of a certain class given the features of a data point (this is what we want to find).
- P(features∣class): The probability of seeing these features, given that it belongs to a certain class.
- P(class): The overall probability of this class (its frequency in the training data).
- P(features): The overall probability of seeing these features.
The "Naive" Assumption
Here's where the "naive" part comes in. To calculate P(features∣class), we would normally have to consider the complex relationships between all the features. The Naive Bayes classifier simplifies this dramatically by making a strong assumption of conditional independence.
The Naive Assumption: All features are independent of each other, given the class.
For text classification, this means the algorithm assumes that the presence of the word "buy" in an email is completely independent of the presence of the word "now", given that the email is spam.
Of course, this is rarely true in reality ("buy" and "now" often appear together). However, the algorithm works surprisingly well in practice despite this unrealistic assumption, especially for text problems. This assumption makes the calculations extremely fast and efficient.
Example: Spam Filtering
Let's say we want to classify an email with the words "Viagra" and "free" as spam. The algorithm would calculate:
- Probability it's Spam:
- What's the probability of "Viagra" appearing in spam emails?
- What's the probability of "free" appearing in spam emails?
- What's the overall probability of any email being spam?
- Multiply these probabilities together.
- Probability it's Not Spam (Ham):
- What's the probability of "Viagra" appearing in ham emails?
- What's the probability of "free" appearing in ham emails?
- What's the overall probability of any email being ham?
- Multiply these probabilities together.
- Compare: The email is assigned to the class with the higher resulting probability.
Python
# Python code with scikit-learn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
# Sample Data: Classifying messages as spam or not spam (ham)
corpus = [
    'Offer for you! Win a free prize now!',
    'Can we meet tomorrow for the project?',
    'Urgent: your account needs attention',
    'Hey, are you free for lunch tomorrow?',
    'Claim your free prize today only'
]
labels = ['spam', 'ham', 'spam', 'ham', 'spam']
# Create a pipeline that first converts text to word counts (vectorizes)
# and then applies the Naive Bayes classifier.
pipeline = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('classifier', MultinomialNB())
])
# Train the model
pipeline.fit(corpus, labels)
# Predict on new text
new_messages = [
    'Can you send me the project file?',
    'Win a prize by clicking now'
]
predictions = pipeline.predict(new_messages)
for msg, pred in zip(new_messages, predictions):
    print(f"'{msg}' -> Predicted: {pred}")