How Computers Understand Text
NLP basics, preprocessing, bag-of-words, TF-IDF
Try explaining the color blue to a computer. Not as a wavelength — as the concept. The feeling of a clear sky. You can't, right? That's the fundamental problem of NLP. Computers don't understand language. They understand numbers. So our job is to figure out clever ways to turn words into numbers without losing too much meaning along the way.
This lesson is about the oldest, simplest approaches to that problem. They're not perfect — modern AI has moved way beyond them — but they're the foundation everything else builds on. And honestly, they're still used in production all over the place, because sometimes simple works.
We'll look at how to preprocess text so computers can work with it, then explore two classic ways of turning words into numbers: Bag of Words and TF-IDF. By the end, you'll understand why "counting words" is both surprisingly useful and deeply flawed.
The language gap
Language is absurdly hard for computers. Think about everything your brain does when you read a sentence. You resolve ambiguity ("bank" — river bank or savings bank?). You infer context. You pick up on sarcasm, tone, and implied meaning. You know that "hot dog" isn't a warm puppy.
Computers get none of that for free. To them, text is just a sequence of bytes. The word "cat" is no more related to "kitten" than it is to "catapult". So before any analysis can happen, we need to transform raw text into something mathematical — a numerical representation the computer can actually work with.
From text to math
The challenge is that every step in this pipeline throws away some information. Preprocessing removes punctuation and casing. Numerical representation loses word order. The art is in keeping enough meaning to be useful while making the data simple enough for math to work on.
From words to numbers
Let's walk through the classic pipeline. These steps were developed decades ago, and even though modern deep learning approaches skip some of them, understanding this pipeline gives you the vocabulary to talk about NLP at any level.
Text Preprocessing
Before you can count anything, you need to clean the text. This usually means three things: lowercasing everything, stripping punctuation, and splitting the text into individual words (called tokenizing).
Why lowercase? Because to a computer, "Cat" and "cat" are completely different strings. Lowercasing ensures they get counted as the same word. Removing punctuation stops "mat" and "mat!" from being treated as different words.
Now try it yourself
Type a sentence into the box below and watch how Bag of Words and TF-IDF represent it differently. Notice how the "Bag of Words" tab treats every word the same — common words like "the" dominate the chart. Then switch to "TF-IDF" and see how those common words get pushed down while the interesting words rise to the top. Try toggling "Remove stop words" to see the difference that simple filter makes.
Text Understanding Explorer
Bag of Words counts how many times each word appears. Every word is treated equally — "the" counts just as much as "cat".
Notice: In Bag of Words, common words like "the" dominate the chart. They tell you nothing about what the text is actually about. Switch to TF-IDF to see the difference.
Key Takeaways
- Computers don't understand text — they need it converted to numbers first. Text preprocessing (lowercasing, removing punctuation, splitting into words) is always the first step.
- Bag of Words is the simplest approach: count word occurrences and represent each document as a vector of counts. Simple, interpretable, and surprisingly useful for many tasks.
- TF-IDF improves on Bag of Words by weighing words based on importance — words that are frequent in one document but rare overall score highest.
- Both methods throw away word order entirely. "Dog bites man" and "man bites dog" produce identical representations. This is their fundamental limitation.
- Despite their limitations, these methods are still widely used in search engines, spam filters, and document classifiers because they're fast, interpretable, and often good enough.
Common Misconceptions
- "Computers can read text." — They can't. They do math on number representations of text. Every NLP system starts by converting words to numbers, and every conversion loses some information.
- "Bag of Words is obsolete." — It's limited, but it's still used in production systems everywhere. For tasks like keyword search or simple document classification, it often outperforms more complex approaches in speed and interpretability.
- "TF-IDF understands meaning." — It doesn't. It's a statistical trick that surfaces distinctive words. It can't tell you that "happy" and "joyful" mean the same thing — that requires word embeddings or neural approaches.