Evaluation & Benchmarks

MMLU, HumanEval, BLEU, ROUGE — measuring if AI works

Advanced25 min

Prerequisites

  • large-language-models

How do you know if one AI model is "better" than another? You can't just ask it a few questions and call it a day. Models are good at different things — one might ace math but flunk creative writing, another might crush code generation but struggle with common sense reasoning.

Benchmarks give us a shared yardstick. They're standardized tests that let us compare models objectively. When OpenAI says GPT-4 scores 86.4% on MMLU, or Anthropic says Claude 3 Opus gets 84.9% on HumanEval, those numbers mean something — you can reproduce them, compare them to other models, and track progress over time.

But here's the catch: benchmarks are far from perfect. They're proxies for real-world performance, not guarantees. A model with a 90% score might still fail catastrophically on your specific task. Understanding what benchmarks measure (and what they don't) is crucial for picking the right model and knowing when to trust the numbers.

Measuring intelligence is hard

There is no single "IQ test" for AI models. Intelligence is multidimensional — knowledge, reasoning, creativity, code, math, common sense, language understanding. A model can be brilliant at one and mediocre at another. That's why we need multiple benchmarks testing different capabilities.

From model to nuanced understanding

AI Model
Knowledge Test (MMLU)
Code Test (HumanEval)
Reasoning Test (HellaSwag)
Math Test (GSM8K)
Score Profile
Nuanced Understanding

The result is a profile, not a number. GPT-4 might score 86% on knowledge, 67% on code, 95% on common sense. Claude 3 Opus might reverse that: 87% knowledge, 85% code, 95% common sense. Neither is "better" — they have different strengths. The right model depends on what you're trying to do.

Beyond scores, we care about cost, speed, and reliability. A model that scores 5% higher but costs 10x more isn't always the right choice. Benchmarks give us performance data, but picking a model requires weighing performance against cost, latency, and your specific use case. Raw scores are just the starting point.

The benchmarks that matter

There are dozens of AI benchmarks, but most fall into a few categories. Some test knowledge breadth, others test reasoning or code generation or language quality. Here are the four most important categories and the benchmarks that define them.

Knowledge & Reasoning (MMLU)

MMLU (Massive Multitask Language Understanding) is the gold standard for measuring breadth of knowledge. It covers 57 subjects across STEM, humanities, social sciences, and professional domains — everything from elementary math to US foreign policy to college biology.

Format: Multiple-choice questions with 4 options.
Scoring: Accuracy (0-100%). Random guessing gets 25%.
Example subjects: Abstract algebra, anatomy, astronomy, business ethics, clinical knowledge, college chemistry, computer security, econometrics, elementary mathematics, high school US history, international law, machine learning, medical genetics, nutrition, philosophy, professional accounting, public relations, sociology, US foreign policy, virology, world religions.

Why it matters: MMLU tests whether a model has broad, factual knowledge. A high MMLU score means the model absorbed a lot of information during training and can recall it accurately. But it doesn't test creativity, reasoning depth, or the ability to apply knowledge to novel problems. It's a knowledge breadth test, not an intelligence test.

What to watch for: Scores above 85% are frontier-level (GPT-4, Claude 3 Opus). Scores around 60-70% are typical for smaller open-source models (7B-13B parameters). Human expert performance on MMLU is estimated around 89%, so we're approaching human-level knowledge breadth on this benchmark.

Step 1 of 4

Now try it yourself

Time to compare models across benchmarks. The dashboard below shows real approximate scores for GPT-4, Claude 3 Opus, Llama 3 70B, Mistral 7B, and Gemma 7B across five key benchmarks. Toggle between raw scores and cost efficiency to see how smaller models punch above their weight when you factor in price. Click any benchmark name to see what it tests and an example question.

Benchmark Comparison Dashboard
GPT-4
Claude 3 Opus
Llama 3 70B
Mistral 7B
Gemma 7B
0-100%
GPT-486.4%
Claude 3 Opus86.8%
Llama 3 70B82.0%
Mistral 7B62.5%
Gemma 7B64.3%
0-100%
GPT-467.0%
Claude 3 Opus84.9%
Llama 3 70B81.7%
Mistral 7B40.2%
Gemma 7B32.3%
0-100%
GPT-495.3%
Claude 3 Opus95.4%
Llama 3 70B85.3%
Mistral 7B83.3%
Gemma 7B81.2%
0-100%
GPT-459.0%
Claude 3 Opus62.0%
Llama 3 70B52.0%
Mistral 7B42.0%
Gemma 7B44.0%
0-100%
GPT-492.0%
Claude 3 Opus95.0%
Llama 3 70B83.0%
Mistral 7B52.2%
Gemma 7B50.9%
Key Insights
Top performers: GPT-4 and Claude 3 Opus lead on most benchmarks, with Claude excelling at code (HumanEval) and math (GSM8K).
Open-source strong: Llama 3 70B is competitive with frontier models on many tasks, especially code and reasoning.
Small models trade-off: Mistral 7B and Gemma 7B score lower but are 100x cheaper to run. Good enough for many real-world tasks.
Important: Benchmark scores don't tell you how good a model will be at YOUR specific task. A model with 90% on MMLU might fail at your domain-specific use case. Always evaluate on your own data. Benchmarks are a starting point, not the final answer.

Key Takeaways

  • No single benchmark captures model quality. You need a profile across multiple tests — knowledge (MMLU), code (HumanEval), reasoning (HellaSwag), math (GSM8K), text quality (BLEU/ROUGE) — to understand a model's strengths and weaknesses.
  • MMLU tests breadth of knowledge across 57 subjects. High scores (85%+) mean the model absorbed lots of factual information, but it doesn't test reasoning depth, creativity, or application to novel problems.
  • HumanEval measures code generation ability by having models write Python functions from docstrings. Scores above 80% indicate strong programming skills, but it's limited to self-contained functions, not real-world debugging or large codebases.
  • HellaSwag tests commonsense reasoning about everyday scenarios. Frontier models approach human performance (95%), but smaller models struggle (60-80%), revealing that common sense requires deep understanding, not just pattern matching.
  • Benchmark scores don't predict performance on YOUR task. A model with 90% on MMLU might fail at your domain-specific use case. Always evaluate on your own data — public benchmarks are a starting point, custom evals are the answer.

Common Misconceptions

  • "Higher benchmark scores always mean a better model." — Not quite. GPT-4 might score 2% higher than Claude 3 Opus on MMLU, but Claude crushes it on code. The "best" model depends on your use case, not a single number.
  • "Benchmarks measure real intelligence." — No. They measure performance on specific tasks. A model can ace MMLU by memorizing facts without truly understanding them. Benchmarks are proxies, not truth.
  • "Models can't cheat on benchmarks." — Wrong. Data contamination is a huge problem. If a model saw MMLU questions during training (even indirectly via web scrapes), it's not being tested — it's recalling. This inflates scores and makes comparisons unreliable.
  • "If a model scores 90%, it will work 90% of the time on my task." — Nope. Benchmark scores are accuracy on test questions, not reliability on your specific use case. A medical chatbot needs evaluation on medical data, not generic MMLU questions.
  • "Small models are always worse because their benchmark scores are lower." — Context matters. A 7B model scoring 60% on MMLU might cost 300x less than GPT-4. If your task doesn't need frontier performance, the smaller model is often the better choice.

Quick check

Why is data contamination a problem for AI benchmarks?