Sequence Models & NLP — Text preprocessing: tokenization, stopwords, stemming/lemmatization

Before we can feed text data to a machine learning model, we need to clean and standardize it. This process, known as text preprocessing, transforms messy, raw text into a predictable format. Let's explore the fundamental steps.

1. Tokenization

Tokenization is the process of breaking down a piece of text into smaller units called tokens. Most commonly, these tokens are words, but they can also be characters or subwords. This is the foundation of text processing.

Sentence: "The quick brown fox jumps over the lazy dog."
Tokens: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']

2. Stopword Removal

Stopwords are extremely common words that add little semantic value to a sentence. Words like "the," "a," "is," "in," and "on" are often removed to reduce noise and focus the model's attention on more important words.

Original Tokens: ['the', 'cat', 'is', 'on', 'the', 'mat']
After Stopword Removal: ['cat', 'mat']

3. Normalization: Stemming vs. Lemmatization

Words can appear in different forms (e.g., "run," "running," "ran"). Normalization is the process of reducing these variations down to a common base or root form. This helps the model recognize that these words are semantically similar. There are two main approaches:

Stemming: This is a crude, rule-based process that chops off the ends of words to get to a base form or "stem." The result may not be a real dictionary word. It's fast but less accurate.
"studies," "studying" -> "studi"
"connection," "connects" -> "connect"
Lemmatization: This is a more sophisticated process that uses a dictionary and morphological analysis to return the base or "lemma" of a word, which is a real dictionary word. It's slower but more accurate.
"studies," "studying" -> "study"
"better" -> "good" (as it knows the lemma)

Which to choose? If speed is critical and you can tolerate some imprecision, use stemming. For most applications where semantic accuracy is important, lemmatization is the better choice.

Code Example using NLTK

Let's put it all together with Python's popular NLTK (Natural Language Toolkit) library.

Python

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# You may need to download these NLTK data packages first
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')

raw_text = "The brilliant students are studying diligently for their upcoming examinations."

# 1. Lowercasing and Tokenization
lower_text = raw_text.lower()
tokens = word_tokenize(lower_text)
print(f"Tokens: {tokens}")

# 2. Stopword Removal
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]
print(f"Filtered Tokens: {filtered_tokens}")

# 3. Stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
print(f"Stemmed Tokens: {stemmed_tokens}")

# 4. Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print(f"Lemmatized Tokens: {lemmatized_tokens}")

LearnCodePro

Sequence Models & NLP — Text preprocessing: tokenization, stopwords, stemming/lemmatization

1. Tokenization

2. Stopword Removal

3. Normalization: Stemming vs. Lemmatization

Code Example using NLTK

Sequence Models & NLP — Word embeddings: word2vec, GloVe, fastText

Sequence Models & NLP — RNNs, LSTMs & GRUs for sequences

Sequence Models & NLP — Seq2Seq & attention mechanism intro

Sequence Models & NLP — Transformers & self-attention intuition

Quick Navigation

This Series

Topics in Data Science, Machine Learning & AI

Categories

Learn More

Want to Track Your Progress?

1. Tokenization

2. Stopword Removal

3. Normalization: Stemming vs. Lemmatization

Code Example using NLTK

More in Sequence Models & NLP

Sequence Models & NLP — Word embeddings: word2vec, GloVe, fastText

Sequence Models & NLP — RNNs, LSTMs & GRUs for sequences

Sequence Models & NLP — Seq2Seq & attention mechanism intro

Sequence Models & NLP — Transformers & self-attention intuition

Quick Navigation

This Series

Topics in Data Science, Machine Learning & AI

Categories

Learn More

Want to Track Your Progress?