Attention Mechanism
Self-attention, multi-head attention, how models learn to focus
Prerequisites
- rnns-lstms
Think about how you read this sentence. Your eyes don't give equal importance to every word — they automatically focus on the words that matter most for understanding. When you read "The bank by the river was steep," your brain instantly connects "bank" to "river" rather than to money. That's attention. And teaching machines to do the same thing is what transformed AI.
Before attention, neural networks for language had a brutal limitation. RNNs processed words one at a time, carrying a hidden state forward like a game of telephone. By the time you reached the end of a long sentence, the information from the beginning had degraded into mush. Imagine trying to translate a 50-word German sentence into English, but you could only look at the sentence through a tiny, foggy window that showed you one word at a time and a fading summary of everything before it.
Attention blew that window wide open. Instead of being forced to process sequentially, the model can now look at every word in the sentence at once and decide, for each word, which other words matter most. It's the single most important idea in modern AI — the mechanism that made transformers possible, that enabled GPT, BERT, Claude, and every large language model you've ever used.
Learning where to look
RNNs process sequentially — by the time you reach word 50, you've half-forgotten word 1. Attention flips this around completely. Instead of forcing information to flow through a chain of hidden states, attention lets the model look at all words at once and decide which ones matter for each prediction.
Here's the core idea: for every word in a sentence, the model asks "which other words should I pay attention to in order to understandthis word?" When processing the word "it" in "The cat sat on the mat because it was tired," the model learns to attend strongly to "cat" — because "it" refers to the cat. No sequential processing required. No information bottleneck. Just a direct lookup across the entire sequence.
The attention pipeline
This mechanism is called self-attention because every word in the sentence attends to every other word in the same sentence. It's what allows a model to build context-aware representations where the meaning of each word is informed by the entire sentence around it.
The mechanics of attention
Attention looks magical on the surface, but the mechanics are surprisingly elegant. It all comes down to three projections, a dot product, and a weighted sum. Let's break it down step by step.
Step 1: Query, Key, Value
Every word in the sentence gets transformed into three different vectors using learned weight matrices. These three representations each play a distinct role:
Think of it like a library. The Query is your search term, the Keys are book titles on the shelf, and the Values are the actual book contents. You compare your search term against every title to figure out which books are relevant, then read those books.
Watch attention in action
Here's an interactive visualization of attention weights. Click any word to see how it "attends" to the rest of the sentence. The lines connecting words show attention strength — thicker lines mean higher weights. Toggle to Multi-Head mode to see how different attention heads capture different patterns simultaneously.
Attention Weight Visualizer
Click any word to select it as the "query." Lines show how much attention that word pays to every other word. Thicker, brighter lines = higher attention weight.
Try this: In "The cat sat on the mat because it was tired," click on "it" — notice the strong attention toward "cat." The model has learned coreference resolution! Then switch to Multi-Head to see how different heads capture different patterns.
Key Takeaways
- Attention lets a model look at every word in a sequence simultaneously and decide which words are most relevant to each other. This eliminates the RNN bottleneck where information had to pass through a chain of hidden states.
- Every word is projected into three vectors: Query (what it's looking for), Key (what it advertises), and Value (what information it carries). Attention scores are computed by the dot product of Queries with Keys, then used to create a weighted sum of Values.
- Multi-head attention runs multiple attention operations in parallel, each learning different types of relationships (syntax, coreference, semantics, position). This gives the model a rich, multi-faceted understanding of context.
- Attention is O(n squared) in sequence length, which makes it expensive for very long sequences. This tradeoff — attending to everything at once gives quality but costs compute — drives much of modern LLM architecture research.
- The attention mechanism, introduced in 2014 for machine translation and generalized in 2017's "Attention Is All You Need," is the foundational building block of every modern transformer-based language model.
Common Misconceptions
- "Attention means the model understands meaning." — Attention learns statistical correlations between positions. When "it" attends to "cat," the model hasn't understood coreference in a human sense — it has learned that tokens in the position and context of "it" tend to correlate with tokens like "cat" in training data. The effect looks like understanding, but the mechanism is pattern matching.
- "More attention heads always means better performance." — There's diminishing returns. Research has shown that many attention heads in trained models are redundant and can be pruned without significant quality loss. The optimal number of heads depends on the task, model size, and embedding dimension.