Transformer Architecture

In 2017, a team at Google published a paper called "Attention Is All You Need." Bit of an understatement, honestly. That paper introduced the Transformer — the architecture behind GPT, BERT, Claude, Llama, and basically every AI system making headlines today. If neural networks are the engine, the Transformer is the specific engine design that changed everything.

Before Transformers, the best language models were RNNs and LSTMs. They worked, but they had a fundamental bottleneck: they processed words one at a time, left to right, carrying information forward in a hidden state. This meant two things. First, they were slow — you couldn't parallelize sequential processing across a GPU. Second, long-range dependencies still got lost despite LSTM gates doing their best to preserve them.

The Transformer threw away recurrence entirely. No hidden states passed word-by-word. Instead, it used self-attention (which we covered in the last lesson) to let every word look at every other word simultaneously. The result was a model that could be trained on massive amounts of data, scale to billions of parameters, and process sequences in parallel. The era of large language models had begun.

The architecture that changed everything

So what makes the Transformer so special? Three things, really. First, parallel processing — unlike RNNs that process one token at a time, the Transformer processes the entire sequence at once. This maps beautifully to GPU hardware, where doing thousands of operations simultaneously is cheap. Second, long-range connections — every token can attend directly to every other token, no matter how far apart they are. The word at position 1 is just as accessible as the word at position 500. Third, scalability — you can make the model bigger (more layers, wider embeddings, more heads) and it keeps getting better, following remarkably predictable scaling laws.

The full Transformer architecture stacks a simple block over and over. Each block takes in a sequence of vectors, refines them through attention and feed-forward layers, and passes the result to the next block. Here's the high-level flow:

The Transformer pipeline

Input Tokens

From tokenizer

Embeddings + Position

Vectors with order

Transformer Blocks x N

Attention + FFN

Output

Next-token probabilities

That's it at the top level. The magic is in what happens inside each Transformer block — and how stacking them creates increasingly sophisticated representations. Let's open one up.

Inside the Transformer

A Transformer block is a carefully designed sequence of sublayers, each performing a specific job. Data flows bottom to top, getting refined at each stage. Let's walk through each component.

Step 1: Input Embeddings + Positional Encoding

First, each input token is mapped to a dense vector using an embedding table — the same kind of embedding we covered in the Word Embeddings lesson. But there's a problem: self-attention treats all positions equally. It has no built-in sense of order. "Dog bites man" and "Man bites dog" would produce identical attention patterns.

Token Embedding

Maps each token ID to a d_model-dimensional vector. "cat" → [0.23, -0.15, 0.82, ...]

Positional Encoding

A unique vector for each position, added to the embedding. Position 0 → [0.00, 1.00, 0.00, ...]

The positional encoding tells the model where each token sits in the sequence. The original Transformer used sinusoidal functions (sines and cosines at different frequencies), which lets the model generalize to longer sequences than it saw during training. Modern models like GPT often use learned position embeddings instead.

Step 1 of 5

Encoder vs. Decoder

The original Transformer paper actually described two stacks: an encoder and a decoder. They share the same basic building blocks, but differ in important ways:

The encoder sees the entire input at once. Every token can attend to every other token, including tokens that come later in the sequence. This is called bidirectional attention. BERT is the most famous encoder-only model — it's great for understanding tasks like classification and question answering because it can see the full context.

The decoder can only look at past tokens. It uses a causal mask to prevent each token from attending to future tokens — after all, during generation, those future tokens don't exist yet. GPT, Claude, and Llama are all decoder-only models. In encoder-decoder models (like the original Transformer, T5, and BART), the decoder also has cross-attention layers that let it look at the encoder's output. This is how translation models connect source and target languages.

Explore the Transformer block

Here's an interactive Transformer block visualizer. Click any layer to see what it does and the tensor dimensions flowing through it. Toggle between Encoder and Decoder to see how the architectures differ. Hit "Process" to watch animated data flow through the entire block.

Transformer Block Visualizer

Click any layer to learn what it does. Toggle between Encoder and Decoder to see the architectural differences. Hit "Process" to watch data flow through the block.

Block Type

Input Tokens[batch, seq_len]

Token Embedding + Positional Encoding[batch, seq_len, d_model]

Multi-Head Self-Attention[batch, seq_len, d_model]

Add & Layer Norm[batch, seq_len, d_model]

Feed-Forward Network[batch, seq_len, d_model]

Add & Layer Norm[batch, seq_len, d_model]

Block Output[batch, seq_len, d_model]

Click a layer to see details about what it does and how data flows through it.

Legend

Residual connection start

Residual connection end (Add & Norm)

Skip connection path

Try this: Switch between Encoder and Decoder to compare the architectures. Notice how the decoder adds masked self-attention (so it cannot peek at future tokens) and cross-attention (to look at the encoder's output). GPT and Claude use decoder-only architectures; BERT uses encoder-only.

Key Takeaways

The Transformer processes entire sequences in parallel using self-attention, eliminating the sequential bottleneck of RNNs. This parallelism maps perfectly to GPU hardware and is the key reason Transformers can scale to billions of parameters.
Each Transformer block follows a consistent pattern: Multi-Head Attention -> Add & Norm -> Feed-Forward Network -> Add & Norm. This pattern repeats 6 to 96+ times, with each layer refining the representation further.
Positional encoding is essential because self-attention has no built-in sense of word order. Without it, "dog bites man" and "man bites dog" would produce identical outputs. Sinusoidal encodings and learned position embeddings are the two main approaches.
Residual connections and layer normalization are what make deep Transformers trainable. Skip connections provide gradient highways through dozens of layers, while layer norm keeps activations stable.
Encoder-only models (BERT) see all tokens bidirectionally and excel at understanding. Decoder-only models (GPT, Claude) use causal masking and excel at generation. Encoder-decoder models (T5) use cross-attention to connect understanding to generation.

Common Misconceptions

"Transformers don't process words sequentially — they see everything at once." This is true during training and encoding, but during generation (inference), decoder models do produce tokens one at a time. The key insight is that within each forward pass, all positions are processed in parallel. The sequential part is that each new token requires a new forward pass.

The architecture that changed everything

The Transformer pipeline

Inside the Transformer

Step 1: Input Embeddings + Positional Encoding

Encoder vs. Decoder

Explore the Transformer block

Transformer Block Visualizer

Legend

Key Takeaways

Common Misconceptions

Quick check

Why does the Transformer architecture need positional encoding?

Large Language Models

Related Topics

Neural Networks & Backpropagation

Training Deep Networks

CNNs — How AI Sees Images