Multimodal AI

For decades, AI was a specialist. Vision models couldn't read text. Language models couldn't see images. They lived in separate worlds. Then something interesting happened — researchers figured out how to teach models to understand both at once. Now you can show GPT-4 a photo of your fridge and ask “what can I cook with this?” That's multimodal AI.

The idea is deceptively simple: instead of building separate systems for text, images, and audio, build one system that understands all of them. The breakthrough that made this possible? Finding a way to represent different modalities — words, pixels, sound waves — in the same mathematical space. Once everything lives in the same vector space, you can compare, search, and translate across modalities as easily as comparing two sentences.

This lesson covers the three main approaches to multimodal AI: contrastive models like CLIP that connect text and images, vision-language models like GPT-4V that reason about images, and diffusion models like Stable Diffusion that generate images from text. Each takes a different angle on the same fundamental challenge — bridging the gap between how humans experience the world and how machines process it.

One brain, many senses

Your brain does something remarkable: when you hear the word “dog,” see a photo of a dog, or hear a bark, the same concept lights up. Different inputs, same understanding. Multimodal AI tries to replicate this. The key insight is to map different modalities — text, images, audio — into the same vector space. When that works, a photo of a golden retriever and the text “a happy dog playing fetch” end up as nearby points in that shared space, even though one started as pixels and the other as words.

Mapping modalities to a shared space

Image

Raw pixels

Vision Encoder

ViT, ResNet

Shared Vector Space

Similar concepts cluster together

Text Encoder

Transformer

Text

Words and sentences

This is the same idea behind word embeddings from earlier in the course, but extended across modalities. Word2Vec put similar words near each other. CLIP puts similar concepts near each other, regardless of whether that concept came from text or an image. And the same principle extends further — models like Whisper do it for audio, and emerging models handle video, 3D objects, and even tactile data.

Why does this matter practically? Because once you have a shared space, cross-modal tasks become easy. Want to search a million photos using a text description? Encode the text, find the nearest image vectors. Want to caption an image? Encode it and decode into language. Want to generate an image from text? Project from the text space into the image space. One representation unlocks all of these capabilities.

Three flavors of multimodal

There is no single “multimodal architecture.” The field has converged on three major approaches, each designed for different tasks. Understanding all three gives you a complete picture of how AI handles multiple modalities today.

Contrastive Learning (CLIP)

OpenAI's CLIP was trained on 400 million image-text pairs scraped from the internet. The training objective is elegantly simple: given a batch of images and captions, learn to match each image with its correct caption.

The contrastive training loop:

1. Take a batch of (image, text) pairs

2. Encode each image with a vision encoder

3. Encode each text with a text encoder

4. Pull matching pairs close in vector space

5. Push non-matching pairs apart

The result is a model that can do zero-shot image classification — classify images into categories it was never explicitly trained on. Just encode the category names as text, encode the image, and pick the nearest text vector. CLIP can also power image search, content moderation, and image-text matching. It is the backbone behind many multimodal systems.

Step 1 of 3

Explore cross-modal search

In this simulation, you can experience how a CLIP-like model works. In “Text to Image” mode, type a description and see how the model ranks images by semantic similarity — the text and each image are encoded into the same vector space, and the closest matches rise to the top. Switch to “Image to Text” mode to see how a vision-language model generates captions for an image, ranked by confidence. Try different queries to build intuition for how cross-modal understanding actually works.

Cross-Modal Similarity Explorer

See how CLIP-like models map text and images into the same vector space. Similar concepts end up close together, regardless of modality.

Try:

🔍

Type a text description and search to see how CLIP ranks images by semantic similarity.

The key insight: one shared vector space

📝

Text

“a sleeping cat”

Text Encoder

CLIP / BERT

Shared Vector Space

Similar concepts
cluster together

Vision Encoder

ViT / ResNet

🐱

Image

photo of a cat

AI connection: This is how Google Lens finds products from photos, how DALL-E understands your text prompts, and how GPT-4V can describe images. The magic is the shared embedding space — once you can convert anything (text, images, audio) into the same kind of vector, cross-modal understanding becomes a similarity search.

Key Takeaways

Multimodal AI maps different modalities (text, images, audio) into a shared vector space so that similar concepts cluster together regardless of whether they started as pixels, words, or sound waves.
CLIP uses contrastive learning on millions of image-text pairs to create a shared embedding space. It can search images with text, do zero-shot classification, and power many downstream multimodal systems.
Vision-language models like GPT-4V and LLaVA combine a vision encoder with a language model, enabling the model to reason about images, answer visual questions, and describe what it sees.
Diffusion models like Stable Diffusion generate images from text by starting with random noise and iteratively denoising it, guided by a text-encoded prompt vector. They are the reverse of understanding.
The shared vector space is the unifying idea. Once you can encode anything into the same space, cross-modal search, translation, and generation all become variants of the same operation: find or generate the nearest point in the target modality.

Common Misconceptions

"Multimodal models don't truly 'see' -- they process learned visual features. A vision encoder extracts statistical patterns from pixels, and the language model reasons about those patterns. The model has never experienced the physical world; it works entirely from learned correlations between visual features and text descriptions."

One brain, many senses

Mapping modalities to a shared space

Three flavors of multimodal

Contrastive Learning (CLIP)

Explore cross-modal search

Cross-Modal Similarity Explorer

The key insight: one shared vector space

Key Takeaways

Common Misconceptions

Quick check

How does CLIP connect images and text?

Model Internals

Related Topics

Large Language Models

Pre-training & Fine-tuning

RLHF & Alignment