Before we can feed text data to a machine learning model, we need to clean and standardize it. This process, known as text preprocessing, transforms messy, raw text into a predictable format. Let's explore the fundamental steps.

1. Tokenization

Tokenization is the process of breaking down a piece of text into smaller units called tokens. Most commonly, these tokens are words, but they can also be characters or subwords. This is the foundation of text processing.

  • Sentence: "The quick brown fox jumps over the lazy dog."
  • Tokens: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']

2. Stopword Removal

Stopwords are extremely common words that add little semantic value to a sentence. Words like "the," "a," "is," "in," and "on" are often removed to reduce noise and focus the model's attention on more important words.

  • Original Tokens: ['the', 'cat', 'is', 'on', 'the', 'mat']
  • After Stopword Removal: ['cat', 'mat']

3. Normalization: Stemming vs. Lemmatization

Words can appear in different forms (e.g., "run," "running," "ran"). Normalization is the process of reducing these variations down to a common base or root form. This helps the model recognize that these words are semantically similar. There are two main approaches:

  • Stemming: This is a crude, rule-based process that chops off the ends of words to get to a base form or "stem." The result may not be a real dictionary word. It's fast but less accurate.
  • "studies," "studying" -> "studi"
  • "connection," "connects" -> "connect"
  • Lemmatization: This is a more sophisticated process that uses a dictionary and morphological analysis to return the base or "lemma" of a word, which is a real dictionary word. It's slower but more accurate.
  • "studies," "studying" -> "study"
  • "better" -> "good" (as it knows the lemma)

Which to choose? If speed is critical and you can tolerate some imprecision, use stemming. For most applications where semantic accuracy is important, lemmatization is the better choice.

Code Example using NLTK

Let's put it all together with Python's popular NLTK (Natural Language Toolkit) library.

Python


import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# You may need to download these NLTK data packages first
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')

raw_text = "The brilliant students are studying diligently for their upcoming examinations."

# 1. Lowercasing and Tokenization
lower_text = raw_text.lower()
tokens = word_tokenize(lower_text)
print(f"Tokens: {tokens}")

# 2. Stopword Removal
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]
print(f"Filtered Tokens: {filtered_tokens}")

# 3. Stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
print(f"Stemmed Tokens: {stemmed_tokens}")

# 4. Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print(f"Lemmatized Tokens: {lemmatized_tokens}")