Reasoning Models & Extended Thinking

o1/o3-style reasoning, test-time compute, self-verification

Advanced25 min

Prerequisites

  • agents-reasoning

What if, instead of answering instantly, an AI could stop and think? Really think — work through a problem, check its reasoning, backtrack when it's wrong, try a different approach? That's exactly what reasoning models do. And the results are stunning: problems that standard LLMs get wrong 90% of the time, reasoning models solve correctly.

Traditional language models generate answers in one forward pass. They're fast, but they can't stop to reconsider. They commit to each token as it's generated. If they start down the wrong reasoning path, they just keep going. Reasoning models break this limitation. They use what's called test-time compute scaling — spending more computation during inference, not during training. Instead of thinking bigger models, think longer thinking.

OpenAI's o1 and o3 models pioneered this approach. They can spend minutes on a single problem, exploring different solution paths, verifying their work, catching mistakes that would slip past standard models. Claude has extended thinking mode, which lets you trade response time for deeper reasoning on complex problems. This isn't just prompt engineering or chain-of-thought — it's a fundamental shift in how models solve hard problems. And it's the next frontier in AI capabilities.

Trading speed for accuracy

Standard LLMs are optimized for speed. They generate each token based on the previous tokens, making one forward pass through the network per token. This is efficient, but it means the model commits to each decision immediately. There's no opportunity to backtrack, reconsider, or explore alternative approaches.

Reasoning models flip this trade-off. Instead of answering instantly, they spend time thinking. They might generate multiple candidate solutions, evaluate each one, and pick the best. They might check their work and catch errors before committing to an answer. They might explore a reasoning tree — trying different approaches in parallel and pruning the paths that don't pan out. The key insight: you can improve performance by letting the model think longer, not just making it bigger.

Standard LLM vs Reasoning Model

Problem
User query
Generate
One forward pass
Answer
Fast, sometimes wrong
Problem
User query
Think...
Explore solutions
Check...
Self-verify
Verified Answer
Slower, more accurate

This matters most for hard problems. On simple queries — “What's the capital of France?” — standard models are fine. But on complex reasoning tasks — math competitions, code debugging, multi-step planning — the ability to think deeply makes a huge difference. O1 scores in the 89th percentile on competitive programming (Codeforces), compared to GPT-4o's 11th percentile. That's not because o1 is a bigger model. It's because it can think longer.

How reasoning models think

Reasoning models combine several techniques to achieve deeper thinking. The exact implementation varies (OpenAI hasn't revealed all the details of o1), but the core principles are clear: test-time compute, chain-of-thought at scale, self-verification, and explicit thinking budgets.

Test-Time Compute Scaling

Traditional scaling in AI is about training bigger models on more data with more compute. That's pre-training compute. Test-time compute scaling is different: you spend more computation during inference — when the model is actually solving a problem.

Pre-training Compute

Train a 100B parameter model on 10T tokens. One-time cost, paid during training. Model gets smarter, but inference speed stays the same.

Traditional approach: bigger models

Test-Time Compute

Use a smaller model, but let it think longer on hard problems. Generate multiple solutions, explore different paths, verify answers. Cost per query, paid during inference.

New approach: longer thinking

The key insight from the o1 paper and related research (like Let's Verify Step by Step): for many hard problems, test-time compute is more efficient than just scaling model size. A smaller model that thinks for 30 seconds can outperform a bigger model that answers instantly. This is a paradigm shift — you don't always need a bigger model, you need a model that can think longer when it matters.

Step 1 of 4

See reasoning in action

This simulation demonstrates the difference between standard and extended thinking modes. Pick a problem, toggle between the two modes, and watch how extended thinking explores the problem space, checks its work, and arrives at verified answers that standard mode often gets wrong.

Select a problem:
Problem:
A train leaves at 2pm traveling 60mph. Another train leaves the same station at 3pm going the same direction at 90mph. When does the second train catch up?
Reasoning mode:
Quick Answer
Quick calculation: the second train is 30mph faster.
Answer:
2 hours after the second train departs (5pm)
⚠️Often incomplete or incorrect - no self-checking
How it works: Standard mode gives a quick answer in one forward pass - fast but prone to errors. Extended Thinking mode spends more compute at inference time to explore the problem, check its work, and self-correct. More thinking budget = more steps = higher accuracy on complex problems.

Key Takeaways

  • Reasoning models trade speed for accuracy by spending more compute during inference (test-time compute) rather than just making bigger models during training (pre-training compute).
  • Test-time compute scaling is the key insight: on hard problems, letting a smaller model think longer often outperforms a bigger model answering instantly. This shifts the paradigm from "train bigger models" to "think longer when it matters."
  • Chain-of-thought at scale means exploring a reasoning tree (multiple approaches in parallel), not just a linear chain. The model tries different strategies, backtracks when stuck, and learns which approaches work for which problems.
  • Self-verification is critical. Reasoning models check their own work, catch errors before they become hallucinations, and revise their answers when verification fails. This metacognitive loop (think → check → revise) makes them more reliable.
  • Extended thinking (Claude) and o1-style reasoning make test-time compute explicit. You can give the model a thinking budget, and it adaptively uses more time on hard problems, less on easy ones. This is more efficient than always thinking deeply or always answering instantly.

Common Misconceptions

  • "Reasoning models are just better at prompting." — They are not. While prompting helps ("think step by step"), reasoning models are trained with RL to explore reasoning trees, verify answers, and allocate compute dynamically. The difference is in training, not just in prompting.
  • "Reasoning models never make mistakes." — They still make errors, just less often on hard problems. O1 gets AIME math problems right 83% of the time — impressive, but not perfect. Self-verification helps, but the model can still convince itself wrong answers are right.
  • "Test-time compute always beats bigger models." — It depends on the task. For simple queries (factual lookup, basic summarization), a standard model is faster and cheaper. Test-time compute shines on hard reasoning problems where exploration and verification add real value. The art is knowing when to use which approach.

Quick check

What does 'test-time compute scaling' mean in the context of reasoning models?