Pre-training & Fine-tuning

How LLMs learn from the internet, SFT, transfer learning

Advanced25 min

Prerequisites

  • large-language-models

Imagine you hire someone who's read every book in the library. They know facts about everything, but they've never actually had a job. They're brilliant but useless — they'll ramble about random topics when you ask them a simple question. That's a pre-trained model. Fine-tuning is job training: teaching that brilliant generalist to actually be helpful.

This two-phase process is how every modern AI assistant is built. Phase one (pre-training) takes months and costs millions — you feed the model trillions of words from the internet and let it absorb the patterns of human language. Phase two (fine-tuning) takes hours or days and costs a fraction of that — you show the model examples of good conversations and it learns to be an assistant instead of a text autocomplete engine.

Understanding these two phases is essential because it explains so much about how LLMs behave. Why they know obscure facts but sometimes give weirdly unhelpful answers. Why you can take a single base model and turn it into a medical assistant, a coding tutor, or a creative writing partner. Why companies fine-tune existing models instead of building from scratch. The answer to all of these is the pre-training / fine-tuning split.

Two phases of training

Think of it as two completely different jobs. Pre-training is unsupervised — the model reads billions of web pages, books, code repositories, and Wikipedia articles, trying to predict the next word. Nobody tells it what's important or how to respond to questions. It's just pattern-matching at an enormous scale. The result is a base model that understands language deeply but has no idea how to be useful.

Fine-tuning flips the script. Now you take that base model and show it thousands of carefully curated examples: "When a human asks this, respond like this." These are instruction/response pairs written by human annotators. The model learns to follow instructions, structure its answers, refuse dangerous requests, and actually help people. This is supervised fine-tuning (SFT) — supervised because each training example has a clear right answer.

The two-phase training pipeline

Raw Text
Terabytes of internet
Pre-training
Next-token prediction
Base Model
Knows a lot, not helpful
Task Data
Instruction pairs (MB)
Fine-tuning
Supervised learning
Useful Model
Helpful assistant

The key insight is that most of the "intelligence" comes from pre-training. The base model already knows grammar, facts, reasoning patterns, coding conventions, and world knowledge. Fine-tuning doesn't add much new knowledge — instead, it teaches the model how to use what it already knows. It's the difference between someone who's read every medical textbook and a doctor who's actually practiced medicine. Same knowledge, radically different usefulness.

From raw knowledge to useful assistant

Let's walk through each phase in detail, including the economics that explain why the entire AI industry is structured the way it is.

Step 1: Pre-training — Reading the Internet

Pre-training is the expensive, foundational phase. The model is trained on a massive, diverse corpus — web pages, books, academic papers, code, Wikipedia, forums, news articles. The only objective: predict the next token given all previous tokens.

What it learns
Grammar, facts, reasoning, code patterns, common sense, world knowledge, mathematical relationships
The hardware
Thousands of GPUs (H100s, TPUs) running for weeks or months. Llama 3 405B used 16,384 H100 GPUs.

The result is a base model (sometimes called a foundation model). It's incredibly capable in a raw sense — it can complete text fluently in dozens of languages, write code, and even solve math problems. But ask it "What's the capital of France?" and it might respond with "What's the capital of Germany? What's the capital of Italy?" because it learned to predict what comes next in quiz-style text, not how to answer questions.

Step 1 of 4

See pre-training vs fine-tuning in action

This simulation lets you watch both training phases side by side. Notice how pre-training processes diverse text (news, code, Wikipedia) while fine-tuning uses structured instruction/response pairs. The loss curves show how pre-training starts from a high loss and drops slowly, while fine-tuning starts from the pre-trained model's loss and drops much faster. Toggle the "Before vs After" examples to see how dramatically fine-tuning changes the model's output quality.

Pre-training vs Fine-tuning Visualizer

Watch both training phases side by side. Pre-training processes diverse internet text over many steps. Fine-tuning starts from the pre-trained model and adapts it with task-specific instruction data.

Phase 1: Pre-training

Data Stream (diverse text)
Wikipedia
The mitochondria is the powerhouse of the cell, responsible for...
Data Size
~1-5 TB
Training Time
Weeks-Months
Cost
$2M-$100M+
Objective
Next Token
Pre-training Loss
0123454.80Training Steps
Waiting...

Phase 2: Fine-tuning (SFT)

Data Stream (instruction pairs)
User: Explain quantum computing in simple terms.
Assistant: Think of a regular computer as flipping coins...
Data Size
~10-100 MB
Training Time
Hours-Days
Cost
$100-$10K
Objective
Follow Instructions
Fine-tuning Loss
0123452.60Training Steps
Waiting...
Data Size Comparison (log scale)
Pre-training
~1-5 TB
Fine-tuning
~MB

Fine-tuning uses roughly 10,000 to 100,000x less data than pre-training

Output Quality: Before vs After Fine-tuning

AI connection: Every model you interact with — ChatGPT, Claude, Llama — went through both phases. The pre-trained base model has vast knowledge but can't hold a conversation. Fine-tuning is what transforms it from a text prediction engine into an assistant that follows instructions, stays on topic, and formats responses helpfully.

Key Takeaways

  • Pre-training is the expensive phase: the model reads trillions of tokens from the internet using next-token prediction. It learns grammar, facts, reasoning, and code patterns. This takes weeks on thousands of GPUs and costs millions of dollars.
  • Fine-tuning (SFT) is the practical phase: the model trains on curated instruction/response pairs to learn how to follow instructions and be helpful. This takes hours or days, uses thousands of examples (not trillions), and costs hundreds to thousands of dollars.
  • Transfer learning is the key insight: knowledge learned during pre-training transfers to downstream tasks. You can take a single pre-trained base model and fine-tune it into a medical assistant, a coding tutor, a legal analyzer, or any other specialist.
  • The economics explain the industry structure: pre-training is so expensive that only a handful of labs do it (OpenAI, Anthropic, Google, Meta). Fine-tuning is cheap enough that thousands of companies customize these models for their specific needs.
  • Fine-tuning does not add new knowledge to the model. It changes how the model uses what it already knows. A base model has the information but does not know how to present it helpfully. Fine-tuning teaches the format and style, not the facts.

Common Misconceptions

  • "Fine-tuning adds new knowledge to the model." -- Not really. Fine-tuning changes how the model uses existing knowledge, not what it knows. If the base model never saw information about your company during pre-training, fine-tuning on your company's FAQ won't reliably teach it new facts. For injecting new knowledge, retrieval-augmented generation (RAG) is usually more effective.

Quick check

Why is fine-tuning much cheaper than pre-training?