Model Internals

You've learned what a transformer is. But the transformers powering Llama, Mistral, and Gemma aren't the same transformer from the 2017 paper. They've been upgraded, optimized, and frankly, redesigned in clever ways. Understanding these internals is like knowing what's under the hood of a car — you don't need it to drive, but you need it to build, fix, or optimize.

The original “Attention Is All You Need” transformer used learned position embeddings, standard multi-head attention, ReLU activations, and LayerNorm. Modern LLMs have replaced every single one of those components. RoPE instead of learned positions. GQA instead of full multi-head attention. SwiGLU instead of ReLU. RMSNorm instead of LayerNorm. Each swap seems small on its own, but together they compound into models that are faster, more memory-efficient, and better at handling long contexts.

This lesson takes you inside the three most important open-weight model families — Llama 3, Mistral/Mixtral, and Gemma — and shows you exactly what changed and why. By the end, you'll be able to read a model card or architecture diagram and understand what every component does.

Beyond the vanilla transformer

Modern LLMs keep the transformer skeleton but swap out nearly every component. Think of it like a car platform — the chassis and layout stay the same, but the engine, transmission, suspension, and brakes are all upgraded. The 2017 transformer was the original design. What ships in production today is a heavily modified version that's been battle-tested at scales the original authors never imagined.

From vanilla to modern transformer

Vanilla Transformer

Learned pos, MHA, ReLU, LayerNorm

Modern Upgrades

RoPE + GQA + SwiGLU + RMSNorm

Modern LLM

Llama, Mistral, Gemma

The reason these changes matter is practical. RoPE lets models handle 128K-token contexts without retraining. GQA slashes memory usage during inference so you can serve models on fewer GPUs. SwiGLU squeezes more performance out of the same parameter count. RMSNorm is just faster to compute, and faster means cheaper to train and serve. None of these innovations are Nobel Prize material on their own — they're engineering improvements. But stacked together, they're the difference between a model that costs $10M to train and one that costs $2M for the same quality.

The upgrades that matter

Let's walk through the four most important changes that define modern transformer architectures. Each one addresses a specific weakness of the original design.

RoPE (Rotary Position Embedding)

The original transformer used either fixed sinusoidal embeddings or learned position vectors to tell the model where each token sits in the sequence. Both approaches have a hard limit: they break down at sequence lengths the model has never seen during training.

How RoPE works:

1. Take the query and key vectors from attention

2. Rotate them by an angle proportional to their position

3. The dot product between Q and K now encodes relative distance

4. Positions 5 and 8 have the same relative encoding as 105 and 108

The key insight is that RoPE encodes relative position rather than absolute position. Two tokens that are 3 positions apart always look the same to the attention mechanism, no matter where they appear in the sequence. This is why Llama 3 can handle 128K-token contexts — RoPE generalizes to sequence lengths far beyond what the model saw during training.

Step 1 of 4

Compare model architectures

Use this interactive tool to compare how Llama 3, Mistral/Mixtral, and Gemma 2 are built. Each model is shown as a vertical stack of component blocks. Gray-bordered blocks are shared across models; colored blocks are unique innovations. Click any block to see a detailed explanation of what it does and why this particular model uses it.

Architecture Comparison Tool

Compare how Llama 3, Mistral/Mixtral, and Gemma 2 build on the vanilla transformer. Click any component to see what it does and why.

Gray border = shared across models Colored = unique innovation

Llama 3

8B / 70B / 405B

Llama 3 Stats

Parameters

8B / 70B / 405B

Context Length

128K tokens

Key Innovation

GQA + SwiGLU + RoPE with larger vocab (128K tokens)

Mistral / Mixtral

7B (Mistral) / 8x7B (Mixtral)

Mistral / Mixtral Stats

Parameters

7B (Mistral) / 8x7B (Mixtral)

Context Length

32K tokens

Key Innovation

Sliding Window Attention + Mixture of Experts (MoE)

Gemma 2

2B / 9B / 27B

Gemma 2 Stats

Parameters

2B / 9B / 27B

Context Length

8K tokens

Key Innovation

Alternating local/global attention + RMSNorm + logit soft-capping

Click any component block above to see a detailed explanation of what it does and why this architecture uses it.

Key pattern: All three models use the same skeleton — token embedding, position encoding, repeated (norm + attention + norm + FFN) blocks, and a final projection head. The innovations are in what goes inside each slot: RoPE for position, GQA or sliding window for attention, SwiGLU or GEGLU for the FFN, and RMSNorm everywhere instead of LayerNorm. These targeted upgrades compound into models that are dramatically better than the original 2017 transformer.

Key Takeaways

Modern LLMs keep the transformer skeleton (embed, attend, feed-forward, project) but swap out nearly every internal component. The 2017 transformer is a starting point, not the final design.
RoPE (Rotary Position Embedding) encodes relative position through rotation matrices, letting models generalize to sequence lengths far beyond their training data. This is how Llama 3 handles 128K-token contexts.
GQA (Grouped Query Attention) shares key-value heads across multiple query heads, cutting KV-cache memory by 4-8x during inference. This is an inference optimization, not a training one -- it makes serving large models dramatically cheaper.
SwiGLU replaces ReLU/GELU in the FFN with a gated activation that lets the network learn which information to pass through. It adds a third weight matrix but consistently improves quality across model scales.
RMSNorm simplifies LayerNorm by dropping the mean subtraction, keeping only the magnitude normalization. It is faster to compute and works just as well, making it the default choice for modern LLMs.

Common Misconceptions

"These models are completely different architectures." Not really -- Llama, Mistral, and Gemma all follow the same high-level pattern: decoder-only transformer with repeated blocks of normalization, attention, normalization, and feed-forward. The differences are in which specific variant of each component they choose. Switching from RoPE to learned positions, or from GQA to MHA, would be a config change, not a rewrite.

Beyond the vanilla transformer

From vanilla to modern transformer

The upgrades that matter

RoPE (Rotary Position Embedding)

Compare model architectures

Architecture Comparison Tool

Llama 3

Llama 3 Stats

Mistral / Mixtral

Mistral / Mixtral Stats

Gemma 2

Gemma 2 Stats

Key Takeaways

Common Misconceptions

Quick check

What does Grouped Query Attention (GQA) primarily save?

Quantization & Compression

Related Topics

Quantization & Compression

Inference & Serving

Evaluation & Benchmarks