Large Language Models
GPT, BERT, Llama, Claude — scaling laws and emergent abilities
Prerequisites
- transformer-architecture
GPT-3 has 175 billion parameters. Claude? Even more. Llama 3 comes in sizes from 8 billion to 405 billion. These numbers are staggering, but what do they actually mean? And why did making models bigger suddenly make them so much smarter? The story of LLMs is really a story about scale — and the surprises that came with it.
For decades, language models existed. They could autocomplete a search query or suggest the next word on your phone keyboard. Useful, but not exactly mind-blowing. Then researchers started scaling up Transformers — throwing more data, more parameters, and more compute at them — and something unexpected happened. These models didn't just get incrementally better at predicting the next word. They started writing essays, solving math problems, and writing code. Capabilities that nobody explicitly programmed seemed to emerge out of raw scale.
That's the phenomenon we're going to unpack in this lesson. What makes a language model "large," how the major model families differ, what scaling laws tell us about throwing money at the problem, and why a model trained to do nothing more than predict the next token can end up doing things that look remarkably like reasoning.
What makes a language model 'large'?
At its core, a large language model is just a Transformer trained on a lot of text to predict the next word. That's it. The same architecture we covered in the last lesson — multi-head attention, feed-forward layers, residual connections, layer norm — scaled up dramatically. More layers, wider embeddings, more attention heads, and training data measured in trillions of tokens scraped from books, websites, code repositories, and more.
The "large" in LLM refers to the parameter count. A parameter is a single number the model learned during training — a weight in a matrix, a bias term. GPT-2 had 1.5 billion of them. GPT-3 jumped to 175 billion. Modern frontier models have hundreds of billions or more. Each parameter is a tiny knob the model has tuned to get better at its one job: predicting what comes next in a sequence.
But here's the key insight: better next-word prediction turns out to require understanding grammar, facts, logic, sentiment, coding patterns, and much more. To predict the next word in a physics textbook, you need to "understand" physics. To complete a Python function, you need to "understand" programming. The training objective is simple, but mastering it requires immense general knowledge.
The LLM pipeline
The flow is deceptively simple. You feed in a prompt, the model produces a probability distribution over its vocabulary for the next token, you sample from that distribution, append the chosen token, and repeat. Every response you've ever gotten from ChatGPT, Claude, or Llama was generated this way — one token at a time, each one chosen based on everything that came before it.
The LLM family tree
Not all LLMs are built the same way. The Transformer architecture gave rise to several distinct families, each with different strengths. Understanding these families is essential for knowing which model to use — and why.
Step 1: Encoder Models (BERT)
Encoder models read text bidirectionally— every token can attend to every other token, including ones that come after it. BERT (Bidirectional Encoder Representations from Transformers) is the most famous example. It was trained with a "masked language model" objective: randomly hide 15% of words and predict them from context.
BERT-base has 110M parameters, BERT-large has 340M. Tiny by today's standards, but it revolutionized NLP in 2018. Variants like RoBERTa, ALBERT, and DeBERTa refined the recipe. Encoder models dominate search engines, content moderation, and enterprise NLP pipelines to this day.
See next-token prediction in action
This is a simplified version of what every LLM does. Pick a prompt, and the model shows you the probability distribution over the next word. Adjust the temperature to control randomness and top-k to limit which candidates are considered. Click "Generate" to sample a word and build text one token at a time — just like GPT, Claude, and Llama do it.
Next-Token Prediction Visualizer
Type a prompt or pick a preset, then adjust temperature and top-k to see how they change the probability distribution. Click "Generate" to sample the next word and watch text generation in action.
The cat sat on the
Temperature controls randomness. Low temperature (<1.0) makes the model confident, concentrating probability on the top choice. High temperature (>1.0) flattens the distribution, making unlikely words more probable.
Top-K limits how many words the model considers. With K=1, it always picks the most likely word (greedy decoding). With K=8, it samples from the 8 most likely words, enabling more diverse outputs.
Key Takeaways
- An LLM is a Transformer trained on massive text data to predict the next token. The "large" refers to parameter count — from billions to hundreds of billions of learned weights. More parameters means more capacity to store knowledge, but also requires proportionally more data and compute.
- Decoder-only models (GPT, Claude, Llama) generate text left-to-right using causal masking and dominate modern AI. Encoder models (BERT) excel at understanding tasks. Encoder-decoder models (T5, BART) connect understanding to generation and are still used for translation and summarization.
- Scaling laws show that LLM performance improves predictably as you increase model size, data, and compute. The Chinchilla paper proved that for optimal results, you should scale data and parameters together — many early models were too large for their training data.
- Emergent abilities — capabilities like chain-of-thought reasoning, arithmetic, and code generation — appear unexpectedly once models cross certain size thresholds. Nobody designed these abilities; they arise from the pressure to predict the next token better and better.
- Temperature and top-k control how the model samples from its probability distribution. Low temperature produces focused, predictable text. High temperature produces diverse, creative (but potentially incoherent) text. This is the same mechanism used in every chatbot you have ever used.
Common Misconceptions
- "LLMs don't truly understand — they predict statistically likely continuations." This is technically accurate but can be misleading. Yes, the training objective is next-token prediction. But to achieve state-of-the-art prediction, a model must build internal representations that capture grammar, facts, logic, and reasoning patterns. Whether this constitutes "understanding" is an open philosophical question, but dismissing it as "just statistics" undersells what these models have learned to do.
- "Bigger is always better." The Chinchilla paper showed this is wrong. A 70B model trained on the right amount of data outperforms a 280B model trained on too little data. Architecture innovations (Mixture of Experts, better attention mechanisms), data quality, and post-training alignment all matter as much as raw parameter count.