RLHF & Alignment
Teaching AI human preferences — reward models, DPO, Constitutional AI
Prerequisites
- pretraining-finetuning
A model that can predict the next token perfectly is not the same as a model that's helpful. A perfectly fluent model might also be perfectly toxic, perfectly misleading, or perfectly useless at following instructions. RLHF is how we bridge that gap — by teaching models what humans actually want, not just what's statistically likely.
After pre-training and supervised fine-tuning, a language model can follow instructions and format responses nicely. But “can follow instructions” is not the same as “consistently gives great answers.” The model might give a technically correct but unhelpful response, or a confident but wrong one, or a thorough one when you wanted something brief. RLHF (Reinforcement Learning from Human Feedback) is the third phase of training that teaches the model to distinguish between a response that's merely acceptable and one that's genuinely good.
This is also where the concept of alignment comes in. An aligned model is one whose behavior matches human intentions and values. Getting this right is one of the hardest open problems in AI — and arguably the most important one.
The alignment problem
Pre-trained models are capable but unaligned. They've absorbed the entire internet — the brilliant and the terrible — and they'll happily reproduce any of it. Ask a base model a question and it might answer helpfully, or it might generate toxic content, give wrong answers with total confidence, or just ignore what you actually asked. The model doesn't have a concept of “this response is helpful” versus “this response is harmful.” It only knows “this text is likely given the previous text.”
Supervised fine-tuning (the previous lesson) helps a lot — it teaches the model the format of being an assistant. But SFT has a fundamental limitation: it teaches the model to imitate good responses, not to understand what makes a response good. The model learns “responses in this style appeared in training data,” not “this response is better than that one because it's more helpful.” RLHF adds that missing piece — a signal for relative quality.
From raw model to aligned assistant
The core insight is that it's much easier for humans to compare two responses than to write a perfect one from scratch. You might not be able to write the ideal answer to a complex medical question, but you can absolutely tell which of two answers is more helpful, more accurate, and more responsible. RLHF exploits this asymmetry: collect comparisons, learn a scoring function, then optimize the model toward higher scores.
Teaching AI what 'good' means
The RLHF pipeline has distinct stages, each solving a different piece of the alignment puzzle. Some recent approaches like DPO simplify this pipeline, but understanding the full version helps you see why the shortcuts work.
Step 1: Collect Human Preferences
The process starts with humans. You show the model a prompt, generate multiple responses, and ask human annotators to rank them from best to worst. This is expensive but crucial — it encodes human judgment into data.
Thousands of annotators produce hundreds of thousands of these comparisons. The quality of this data directly determines how well the final model behaves. Anthropic, OpenAI, and others invest heavily in annotator training, guidelines, and quality control. This is the most labor-intensive part of the entire pipeline.
Train a reward model yourself
In this simulation, you play the role of a human annotator. Read the prompt, compare the three model responses, and rank them from best to worst. Your rankings become training data for a reward model. Then watch how the model would be updated to reinforce the style you preferred and discourage the style you disliked. Try all four scenarios to build intuition for what RLHF actually does.
RLHF Preference Ranking Simulator
You are the human annotator. Read the prompt, rank the three model responses, and see how your preferences train a reward model.
AI connection: This is exactly how Claude, ChatGPT, and other assistants are trained. Real human annotators rank thousands of responses. Their preferences train a reward model, which then guides the language model toward responses that are helpful, honest, and harmless — not just fluent.
Key Takeaways
- RLHF is the third phase of training (after pre-training and SFT) that teaches models to distinguish between acceptable responses and genuinely good ones. It uses human preference comparisons rather than demonstrations.
- The reward model is the key innovation: a neural network that learns to predict human preferences, turning subjective quality judgments into a scalar score the RL algorithm can optimize.
- PPO (Proximal Policy Optimization) updates the language model to generate responses that score high on the reward model, with a KL divergence constraint to prevent the model from drifting too far from its SFT foundation.
- DPO (Direct Preference Optimization) simplifies the pipeline by skipping the reward model entirely, training directly on preference pairs. It is becoming the standard approach for many teams.
- Constitutional AI reduces dependence on human annotators by having the model critique and revise its own outputs against a set of principles, creating synthetic preference data at scale.
Common Misconceptions
- "RLHF makes models smarter." -- It does not. RLHF does not add new knowledge or capabilities. It redirects existing capabilities toward helpfulness, honesty, and harmlessness. The model already knew the good answer — RLHF teaches it to consistently choose that answer over the rambling, misleading, or harmful alternatives.