Evaluation & Benchmarks
MMLU, HumanEval, BLEU, ROUGE — measuring if AI works
Prerequisites
- large-language-models
How do you know if one AI model is "better" than another? You can't just ask it a few questions and call it a day. Models are good at different things — one might ace math but flunk creative writing, another might crush code generation but struggle with common sense reasoning.
Benchmarks give us a shared yardstick. They're standardized tests that let us compare models objectively. When OpenAI says GPT-4 scores 86.4% on MMLU, or Anthropic says Claude 3 Opus gets 84.9% on HumanEval, those numbers mean something — you can reproduce them, compare them to other models, and track progress over time.
But here's the catch: benchmarks are far from perfect. They're proxies for real-world performance, not guarantees. A model with a 90% score might still fail catastrophically on your specific task. Understanding what benchmarks measure (and what they don't) is crucial for picking the right model and knowing when to trust the numbers.
Measuring intelligence is hard
There is no single "IQ test" for AI models. Intelligence is multidimensional — knowledge, reasoning, creativity, code, math, common sense, language understanding. A model can be brilliant at one and mediocre at another. That's why we need multiple benchmarks testing different capabilities.
From model to nuanced understanding
The result is a profile, not a number. GPT-4 might score 86% on knowledge, 67% on code, 95% on common sense. Claude 3 Opus might reverse that: 87% knowledge, 85% code, 95% common sense. Neither is "better" — they have different strengths. The right model depends on what you're trying to do.
Beyond scores, we care about cost, speed, and reliability. A model that scores 5% higher but costs 10x more isn't always the right choice. Benchmarks give us performance data, but picking a model requires weighing performance against cost, latency, and your specific use case. Raw scores are just the starting point.
The benchmarks that matter
There are dozens of AI benchmarks, but most fall into a few categories. Some test knowledge breadth, others test reasoning or code generation or language quality. Here are the four most important categories and the benchmarks that define them.
Knowledge & Reasoning (MMLU)
MMLU (Massive Multitask Language Understanding) is the gold standard for measuring breadth of knowledge. It covers 57 subjects across STEM, humanities, social sciences, and professional domains — everything from elementary math to US foreign policy to college biology.
Why it matters: MMLU tests whether a model has broad, factual knowledge. A high MMLU score means the model absorbed a lot of information during training and can recall it accurately. But it doesn't test creativity, reasoning depth, or the ability to apply knowledge to novel problems. It's a knowledge breadth test, not an intelligence test.
What to watch for: Scores above 85% are frontier-level (GPT-4, Claude 3 Opus). Scores around 60-70% are typical for smaller open-source models (7B-13B parameters). Human expert performance on MMLU is estimated around 89%, so we're approaching human-level knowledge breadth on this benchmark.
Now try it yourself
Time to compare models across benchmarks. The dashboard below shows real approximate scores for GPT-4, Claude 3 Opus, Llama 3 70B, Mistral 7B, and Gemma 7B across five key benchmarks. Toggle between raw scores and cost efficiency to see how smaller models punch above their weight when you factor in price. Click any benchmark name to see what it tests and an example question.
Key Takeaways
- No single benchmark captures model quality. You need a profile across multiple tests — knowledge (MMLU), code (HumanEval), reasoning (HellaSwag), math (GSM8K), text quality (BLEU/ROUGE) — to understand a model's strengths and weaknesses.
- MMLU tests breadth of knowledge across 57 subjects. High scores (85%+) mean the model absorbed lots of factual information, but it doesn't test reasoning depth, creativity, or application to novel problems.
- HumanEval measures code generation ability by having models write Python functions from docstrings. Scores above 80% indicate strong programming skills, but it's limited to self-contained functions, not real-world debugging or large codebases.
- HellaSwag tests commonsense reasoning about everyday scenarios. Frontier models approach human performance (95%), but smaller models struggle (60-80%), revealing that common sense requires deep understanding, not just pattern matching.
- Benchmark scores don't predict performance on YOUR task. A model with 90% on MMLU might fail at your domain-specific use case. Always evaluate on your own data — public benchmarks are a starting point, custom evals are the answer.
Common Misconceptions
- "Higher benchmark scores always mean a better model." — Not quite. GPT-4 might score 2% higher than Claude 3 Opus on MMLU, but Claude crushes it on code. The "best" model depends on your use case, not a single number.
- "Benchmarks measure real intelligence." — No. They measure performance on specific tasks. A model can ace MMLU by memorizing facts without truly understanding them. Benchmarks are proxies, not truth.
- "Models can't cheat on benchmarks." — Wrong. Data contamination is a huge problem. If a model saw MMLU questions during training (even indirectly via web scrapes), it's not being tested — it's recalling. This inflates scores and makes comparisons unreliable.
- "If a model scores 90%, it will work 90% of the time on my task." — Nope. Benchmark scores are accuracy on test questions, not reliability on your specific use case. A medical chatbot needs evaluation on medical data, not generic MMLU questions.
- "Small models are always worse because their benchmark scores are lower." — Context matters. A 7B model scoring 60% on MMLU might cost 300x less than GPT-4. If your task doesn't need frontier performance, the smaller model is often the better choice.