Traditional methods like one-hot encoding represent words as sparse vectors with thousands of dimensions, where each word is a 1 and the rest are 0s. This approach has two major flaws: it's inefficient (high-dimensionality) and it contains no notion of similarity (the vectors for "cat" and "dog" are as different as "cat" and "car").
Word embeddings solve this. They are dense, low-dimensional vector representations of words where semantically similar words are close to each other in the vector space.
This allows us to perform amazing vector arithmetic, like the classic example: vector('King') - vector('Man') + vector('Woman') ≈ vector('Queen').
1. Word2Vec (Google)
Word2Vec is a predictive model that learns embeddings by analyzing the context in which words appear. It has two main architectures:
- Continuous Bag-of-Words (CBOW): The model learns to predict the current word based on its surrounding context words. It's fast and good for frequent words.
- Skip-gram: The model does the opposite; it learns to predict the surrounding context words based on the current word. It's slower but works better for rare words.
2. GloVe (Stanford)
GloVe, which stands for Global Vectors for Word Representation, takes a different approach. Instead of using a predictive model with local context windows like Word2Vec, GloVe learns embeddings by analyzing a global word-word co-occurrence matrix. It constructs a huge matrix of how often words appear near each other in a large corpus and then uses matrix factorization to learn the vector representations. This allows it to capture both local context and global statistics effectively.
3. fastText (Facebook/Meta)
A major limitation of Word2Vec and GloVe is their inability to handle out-of-vocabulary (OOV) words—words that weren't in the training data. fastText solves this by treating words not as single units but as a bag of character n-grams.
For example, the word "apple" with n=3 would be represented by <ap, app, ppl, ple, le> (plus the special word itself, <apple>). The final word embedding is the sum of these n-gram vectors. This allows fastText to construct a reasonable vector for a word it has never seen before, like "applesauce," by using the vectors of its constituent parts.
Code Example: Using Pre-trained Embeddings with Gensim
Training embeddings from scratch requires a massive text corpus. It's more common to use pre-trained models. Let's use gensim to load a pre-trained Word2Vec model and explore its capabilities.
Python
import gensim.downloader as api
# Load a pre-trained Word2Vec model (e.g., trained on Google News)
# This will download the model, which can be a large file.
word_vectors = api.load('word2vec-google-news-300')
# Find the most similar words to 'king'
print("Words most similar to 'king':")
print(word_vectors.most_similar('king'))
print("-" * 30)
# Perform the famous vector arithmetic
# king - man + woman = ?
print("king - man + woman = ?")
result = word_vectors.most_similar(positive=['king', 'woman'], negative=['man'], topn=1)
print(result)
print("-" * 30)
# Calculate the similarity between two words
similarity = word_vectors.similarity('cat', 'dog')
print(f"Similarity between 'cat' and 'dog': {similarity:.4f}")