Tokenization

Breaking text into pieces — BPE, WordPiece, SentencePiece

Beginner25 min

Prerequisites

  • text-understanding

Imagine you've got a massive pile of LEGO bricks dumped on the floor. Before you can build anything, you need to sort them — figure out which pieces you have, group similar ones together, maybe break apart some that are stuck. That's basically what tokenization does with text.

Here's the thing most people miss: computers can't read. Not in any meaningful sense. They work with numbers. So before a language model can do anything with your prompt — answer a question, write code, summarize an article — it first has to chop that text into smaller pieces and convert each piece into a number. Those pieces are called tokens.

Tokenization is one of those foundational steps that sounds boring until you realize it affects everything. How many tokens your text uses determines how much context the model can see. Whether it handles rare words or typos gracefully? That depends on the tokenizer. Even the cost of an API call is measured in tokens. So yeah — worth understanding.

Breaking text into pieces

At its core, tokenization is just splitting text into chunks that the model can work with. Sometimes a token is a full word. Sometimes it's part of a word. Sometimes it's a single character or even a punctuation mark. The method you choose has real consequences.

Tokenization in action

"Hello, how are you?"
Tokenizer
["Hello", ",", "how", "are", "you", "?"]

Why does this matter so much? Three big reasons. First, vocabulary size — the tokenizer decides how many unique tokens the model needs to know. A word-level tokenizer for English might need 500,000+ entries. That's a huge lookup table, and it gets worse with every new language.

Second, context window usage. Models have a fixed number of tokens they can process at once (GPT-4 can handle 128k tokens, for instance). If your tokenizer is inefficient and turns every word into five tokens, you're burning through that window fast.

Third, rare word handling. What happens when the model encounters "supercalifragilisticexpialidocious" or a misspelling like "definately"? A word-level tokenizer just shrugs — it's never seen those before. A smarter tokenizer breaks them into familiar pieces and handles them gracefully.

Four ways to tokenize

There's more than one way to slice text into tokens, and each approach makes different trade-offs. Let's walk through the four main strategies.

Word Tokenization

The simplest approach: split on spaces and punctuation. Each word becomes its own token.

Input
"I love preprocessing"
Tokens
["I", "love", "preprocessing"]

Pros: dead simple, intuitive, easy to implement.

Cons: vocabulary explodes fast. Every verb conjugation ("run", "running", "runs", "ran") is a separate entry. Add multiple languages and you're looking at millions of tokens. New words or typos? Completely unknown.

Step 1 of 4

So why did BPE win? It boils down to a practical trade-off. Word-level tokenization gives you short sequences but an unmanageable vocabulary. Character-level gives you a tiny vocabulary but painfully long sequences. BPE sits right in the middle — a vocabulary of around 50,000 tokens that can represent virtually any text efficiently. Common words stay as single tokens (fast to process), and rare or unknown words get broken into familiar subword pieces (graceful degradation instead of a brick wall).

Now try it yourself

Time to get hands-on. Type anything into the box below and watch how different tokenization methods break it apart. Try long compound words like "internationalization" to see how subword methods handle them. Paste some code and see how differently it tokenizes compared to prose. Throw in some emojis. The more you experiment, the more intuition you'll build about how tokenizers actually see your text.

Tokenization Explorer

Tokens9 tokens
Thequickbrownfoxjumpsoverthelazydog.
Token count comparison

Key Takeaways

  • Tokenization is the very first step in any NLP pipeline. It converts raw text into numbered tokens that the model can actually process.
  • Word-level tokenization is simple but creates massive vocabularies and can't handle unknown words. Character-level is the opposite extreme — tiny vocabulary, but very long sequences.
  • Subword tokenization (BPE) is the industry standard because it balances vocabulary size (~50k) with sequence length, and it handles rare words by breaking them into known pieces.
  • Token count directly affects cost and context window usage. More efficient tokenization means you can fit more content into each API call.
  • Tokenization happens BEFORE the model sees anything. It's a fixed preprocessing step — the model never sees raw text, only token IDs.

Common Misconceptions

  • "Tokens are always words." — They're not. A token might be a word, part of a word, a single character, or even a punctuation mark. In BPE, "playing" might be two tokens: "play" and "ing".
  • "Tokenization is a minor implementation detail." — It fundamentally shapes what the model can learn. Languages that tokenize poorly (many characters per token) get worse model performance.
  • "Tokenization happens inside the model." — It happens before the model. The tokenizer is a separate, deterministic preprocessing step with its own fixed vocabulary.

Quick check

Why is subword tokenization (BPE) preferred over word-level tokenization for modern LLMs?