Model Internals
RoPE, MoE, GQA — inside Llama, Mistral, Gemma
Prerequisites
- transformer-architecture
You've learned what a transformer is. But the transformers powering Llama, Mistral, and Gemma aren't the same transformer from the 2017 paper. They've been upgraded, optimized, and frankly, redesigned in clever ways. Understanding these internals is like knowing what's under the hood of a car — you don't need it to drive, but you need it to build, fix, or optimize.
The original “Attention Is All You Need” transformer used learned position embeddings, standard multi-head attention, ReLU activations, and LayerNorm. Modern LLMs have replaced every single one of those components. RoPE instead of learned positions. GQA instead of full multi-head attention. SwiGLU instead of ReLU. RMSNorm instead of LayerNorm. Each swap seems small on its own, but together they compound into models that are faster, more memory-efficient, and better at handling long contexts.
This lesson takes you inside the three most important open-weight model families — Llama 3, Mistral/Mixtral, and Gemma — and shows you exactly what changed and why. By the end, you'll be able to read a model card or architecture diagram and understand what every component does.
Beyond the vanilla transformer
Modern LLMs keep the transformer skeleton but swap out nearly every component. Think of it like a car platform — the chassis and layout stay the same, but the engine, transmission, suspension, and brakes are all upgraded. The 2017 transformer was the original design. What ships in production today is a heavily modified version that's been battle-tested at scales the original authors never imagined.
From vanilla to modern transformer
The reason these changes matter is practical. RoPE lets models handle 128K-token contexts without retraining. GQA slashes memory usage during inference so you can serve models on fewer GPUs. SwiGLU squeezes more performance out of the same parameter count. RMSNorm is just faster to compute, and faster means cheaper to train and serve. None of these innovations are Nobel Prize material on their own — they're engineering improvements. But stacked together, they're the difference between a model that costs $10M to train and one that costs $2M for the same quality.
The upgrades that matter
Let's walk through the four most important changes that define modern transformer architectures. Each one addresses a specific weakness of the original design.
RoPE (Rotary Position Embedding)
The original transformer used either fixed sinusoidal embeddings or learned position vectors to tell the model where each token sits in the sequence. Both approaches have a hard limit: they break down at sequence lengths the model has never seen during training.
The key insight is that RoPE encodes relative position rather than absolute position. Two tokens that are 3 positions apart always look the same to the attention mechanism, no matter where they appear in the sequence. This is why Llama 3 can handle 128K-token contexts — RoPE generalizes to sequence lengths far beyond what the model saw during training.
Compare model architectures
Use this interactive tool to compare how Llama 3, Mistral/Mixtral, and Gemma 2 are built. Each model is shown as a vertical stack of component blocks. Gray-bordered blocks are shared across models; colored blocks are unique innovations. Click any block to see a detailed explanation of what it does and why this particular model uses it.
Architecture Comparison Tool
Compare how Llama 3, Mistral/Mixtral, and Gemma 2 build on the vanilla transformer. Click any component to see what it does and why.
Gray border = shared across models Colored = unique innovation
Llama 3
Llama 3 Stats
Mistral / Mixtral
Mistral / Mixtral Stats
Gemma 2
Gemma 2 Stats
Click any component block above to see a detailed explanation of what it does and why this architecture uses it.
Key pattern: All three models use the same skeleton — token embedding, position encoding, repeated (norm + attention + norm + FFN) blocks, and a final projection head. The innovations are in what goes inside each slot: RoPE for position, GQA or sliding window for attention, SwiGLU or GEGLU for the FFN, and RMSNorm everywhere instead of LayerNorm. These targeted upgrades compound into models that are dramatically better than the original 2017 transformer.
Key Takeaways
- Modern LLMs keep the transformer skeleton (embed, attend, feed-forward, project) but swap out nearly every internal component. The 2017 transformer is a starting point, not the final design.
- RoPE (Rotary Position Embedding) encodes relative position through rotation matrices, letting models generalize to sequence lengths far beyond their training data. This is how Llama 3 handles 128K-token contexts.
- GQA (Grouped Query Attention) shares key-value heads across multiple query heads, cutting KV-cache memory by 4-8x during inference. This is an inference optimization, not a training one -- it makes serving large models dramatically cheaper.
- SwiGLU replaces ReLU/GELU in the FFN with a gated activation that lets the network learn which information to pass through. It adds a third weight matrix but consistently improves quality across model scales.
- RMSNorm simplifies LayerNorm by dropping the mean subtraction, keeping only the magnitude normalization. It is faster to compute and works just as well, making it the default choice for modern LLMs.
Common Misconceptions
- "These models are completely different architectures." Not really -- Llama, Mistral, and Gemma all follow the same high-level pattern: decoder-only transformer with repeated blocks of normalization, attention, normalization, and feed-forward. The differences are in which specific variant of each component they choose. Switching from RoPE to learned positions, or from GQA to MHA, would be a config change, not a rewrite.