The magic of the Transformer model doesn't come from recurrence, but from a powerful and efficient mechanism called Scaled Dot-Product Attention. This is the component that allows the model to calculate the "relevance" between words in a sequence and create context-aware word representations.
1. The Ingredients: Queries, Keys, and Values
To understand attention, we first need to understand the three vectors that govern it. For every input word vector, the model learns to project it into three new vectors:
- Query (Q): This vector represents the current word's perspective. You can think of it as a "question" that the word is asking, like "Who in this sentence is relevant to me?"
- Key (K): This vector represents a word's "identity" or what it offers. It's the counterpart to the Query. It's like a label on a filing cabinet that the Query can match against.
- Value (V): This vector represents the actual meaning or content of the word. Once the relevance is determined, this is the vector that will be used to build the new representation.
2. The Attention Formula
The entire process is captured in a single, elegant formula:
Attention(Q,K,V)=softmax(dkQKT)V
Let's break this down step-by-step.
- Calculate Similarity Scores: QKT The first step is to score how relevant each word is to every other word. We do this by taking the dot product of the Query vector of a word with the Key vector of every word in the sequence (including itself). If two vectors are similar, their dot product will be large; if they are dissimilar, it will be small. This results in a matrix of raw attention scores.
- Scale: d_k
... The dot products can grow very large in magnitude, which can push the softmax function (our next step) into regions where its gradients are tiny. This would slow down or stall the learning process. To counteract this, we scale the scores down by dividing by the square root of the dimension of the key vectors, d_k
. This is a simple but crucial trick for stabilizing training.
- Get Attention Weights: softmax(...) We then apply a softmax function to the scaled scores. The softmax function converts the scores into a set of positive numbers that all sum to 1. These numbers are our final attention weights. A word with a weight of 0.8 is considered highly relevant, while a word with a weight of 0.05 is not.
- Create the Output Vector: (...)V Finally, we take a weighted sum of all the Value vectors in the sequence, using our calculated attention weights. Words that were deemed highly relevant (high attention weight) will contribute heavily to the final vector, while irrelevant words will contribute very little.
The result is a new vector for each word that is a rich, contextual representation—a blend of itself and the other words in the sentence, weighted by their importance.
Conceptual Code Example
This numpy snippet demonstrates the math:
Python
import numpy as np
def softmax(x):
return np.exp(x) / np.sum(np.exp(x), axis=-1, keepdims=True)
# Example: 3 words, vector dimension of 4
q = np.random.randn(3, 4)
k = np.random.randn(3, 4)
v = np.random.randn(3, 4)
d_k = k.shape[-1]
# 1. Similarity Scores
scores = np.matmul(q, k.T)
print("Raw Scores:\n", scores)
# 2. Scale
scaled_scores = scores / np.sqrt(d_k)
print("\nScaled Scores:\n", scaled_scores)
# 3. Attention Weights
weights = softmax(scaled_scores)
print("\nAttention Weights (sum to 1):\n", weights)
# 4. Output Vector
output = np.matmul(weights, v)
print("\nFinal Output Vectors:\n", output)