Word Embeddings & Vector Spaces
Turning words into numbers — Word2Vec, GloVe, similarity
Prerequisites
- math-intuition
What if I told you that a computer can figure out that "king" is to "queen" as "man" is to "woman" — without anyone teaching it what royalty or gender means? Just by reading a lot of text. That's the magic of word embeddings, and it's one of the most beautiful ideas in AI.
In the last few lessons, we saw how Bag of Words and TF-IDF turn text into numbers. They work, but they have a fatal flaw: they treat every word as a completely independent thing. To those methods, "happy" and "joyful" are no more related than "happy" and "refrigerator." That's a problem if you want AI to actually understand meaning.
Word embeddings fix this. Instead of giving each word a single number or a sparse count, they give each word a rich, dense vector — a list of numbers that encodes what the word means. Words with similar meanings end up with similar vectors. And the math that falls out of this is genuinely stunning.
Words as points in space
Here's the core idea: what if every word in the English language were a point in space? Not a boring 2D space — imagine a space with 300 dimensions. You can't visualize it, but the math works perfectly. In this space, words that appear in similar contexts would naturally end up near each other.
Think about the word "cat." Where does it show up in text? Near words like "pet," "fur," "purr," and "kitten." Now think about "dog." It shows up near "pet," "bark," "leash," and "puppy." Because "cat" and "dog" share so much context — both are pets, both are furry animals — they end up close together in the embedding space.
The linguist J.R. Firth put it best back in 1957: "You shall know a word by the company it keeps." That one sentence is basically the entire theory behind word embeddings.
From word to vector
Each number in the vector captures some aspect of the word's meaning. One dimension might roughly correspond to "is this an animal?" Another might be "is this something you eat?" In practice, the dimensions don't map to clean human concepts — they're messy statistical patterns. But the overall effect is that similar words get similar vectors, and that's incredibly powerful.
How machines learn meaning
Word embeddings aren't hand-crafted. Nobody sat down and decided that "cat" should be [0.2, −0.5, 0.8, ...]. Instead, a neural network learns these vectors by reading vast amounts of text. Here's the process, step by step.
The Distributional Hypothesis
The whole approach rests on a single insight: words that appear in similar contexts have similar meanings. It's a fancy way of saying that if two words are often surrounded by the same other words, they probably mean something related.
What words fit? Cat, dog, hamster, rabbit. They all work because they appear in similar contexts. That shared context is what gives them similar embeddings. You don't need to tell the model that these are all animals — it figures it out from the patterns.
Now try it yourself
Below is a simplified 2D embedding space with about 20 words plotted by meaning. Click any word to see its three nearest neighbors and their similarity scores. Try the search box to see where a new word would land. And don't miss the "King − Man + Woman = ?" demo — toggle it on to see vector arithmetic in action, with arrows showing each step.
Word Embedding Space
Each dot is a word, plotted by meaning. Click any word to see its nearest neighbors. Words close together have similar meanings.
Vector Arithmetic: King − Man + Woman = ?
AI connection: Real word embeddings work in hundreds of dimensions, not just two. But the same idea applies — words with similar meanings cluster together, and vector arithmetic captures relationships like analogy, gender, and tense. This is the foundation that powers modern search engines, recommendation systems, and large language models.
Key Takeaways
- Word embeddings represent each word as a dense vector of numbers (typically 100-300 dimensions). Words with similar meanings get similar vectors, because they appear in similar contexts.
- Word2Vec learns embeddings by training a neural network to predict context words. The vectors are a side effect of this training process, not the primary goal.
- Vector arithmetic works on word embeddings: "king - man + woman = queen" works because concepts like gender and royalty are encoded as directions in the vector space.
- Similarity between word embeddings is measured by cosine similarity or Euclidean distance. Words close together in the space are semantically related.
- Modern models like BERT produce contextual embeddings, giving different vectors for the same word depending on context (solving the "bank" ambiguity problem).
Common Misconceptions
- "Embeddings understand meaning the way humans do." -- They don't. They capture statistical patterns of word co-occurrence. A word embedding doesn't know what a cat looks like or sounds like -- it just knows that "cat" appears in similar contexts to "dog" and "pet."
- "Similar spelling means similar embeddings." -- Nope. "Cat" and "catapult" have very different embeddings despite sharing letters. What matters is context, not spelling. "Sofa" and "couch" have very similar embeddings despite looking completely different.
- "King - Man + Woman = Queen works perfectly every time." -- It's a beautiful demo, but in practice vector arithmetic is noisy. The result is usually close to "queen" but not exactly "queen." It works best on well-represented relationships in the training data.