Attention Mechanism Explorer

Interactive demonstration of attention weights and patterns

Input Configuration

Attention Heads

116

Learning Modes

Token Information

Thepos: 0
catpos: 1
satpos: 2
onpos: 3
thepos: 4
matpos: 5

Attention Heatmap - Head 1

6 × 6 matrix

How to read: Rows represent query tokens, columns represent key tokens. Darker colors indicate stronger attention weights.

Hover over cells to see which tokens this query token is attending to most strongly.

Attention Analysis & Insights

6
Tokens
8
Attention Heads
0.993
Max Weight
0.861
Avg Weight

Head 1 - Top Attention Patterns

catsat
0.993
satcat
0.993
themat
0.992
matthe
0.992
saton
0.992

💡 Key Insights

Multi-Head Benefits: Different attention heads can specialize in different types of relationships (syntactic, semantic, positional).
Attention Patterns: Self-attention allows each token to look at all other tokens, enabling long-range dependencies.
Computational Efficiency: Attention can be computed in parallel for all positions, unlike recurrent mechanisms.

Understanding Attention Mechanisms

✨ What is Attention?

Attention mechanisms allow neural networks to focus on the most relevant parts of the input when processing each element. Instead of treating all inputs equally, attention assigns weights based on relevance.

🔍 How It Works

Each token creates three vectors: Query (what it's looking for), Key (what it offers), and Value (the information it contains). Attention weights are computed by comparing queries with keys.

🎯 Self vs Cross Attention

  • Self-Attention: Tokens attend to other tokens in the same sequence
  • Cross-Attention: Tokens attend to tokens from a different sequence

🧠 Multi-Head Attention

Multiple attention heads run in parallel, each learning to focus on different types of relationships. This allows the model to capture various linguistic patterns simultaneously.

📚 Try These Experiments

1. Compare Attention Heads:

Switch between different heads to see how they focus on different relationships in the same sentence.

2. Sentence Length Effect:

Try different sentence lengths and observe how attention patterns change with more context.

3. Token Relationships:

Hover over tokens to see which words they attend to most strongly.

4. Mathematical Foundation:

Use the step-by-step mode to understand the mathematical computation behind attention.