CNNs — How AI Sees Images
Convolution, filters, pooling, feature maps, transfer learning
Prerequisites
- neural-networks
Close your eyes and think of a cat. You didn't picture a list of pixel values, did you? You pictured ears, whiskers, fur — features. Your brain automatically breaks down images into meaningful parts. CNNs do something surprisingly similar. They learn to detect edges, then shapes, then objects — building up from simple to complex, layer by layer.
A regular neural network sees an image as one flat list of numbers. Every pixel connects to every neuron. That works for small images, but for a 1080p photo with 6 million pixels? You'd need billions of connections in the first layer alone. Worse, it throws away all spatial structure — it doesn't know that neighboring pixels are related.
Convolutional Neural Networks (CNNs) solve this by doing what your visual cortex does: look at small patches of the image at a time, detect local patterns, and gradually assemble them into bigger concepts. A pixel alone means nothing. A patch of pixels might be an edge. A group of edges might be a circle. A circle in the right place might be an eye. An eye plus other features? That's a face. CNNs learn this hierarchy automatically.
Filters that learn to see
The core idea of a CNN is the convolution operation. Instead of connecting every pixel to every neuron, you slide a small filter (typically 3x3 or 5x5) across the image, computing a dot product at each position. The filter is just a tiny grid of weights — numbers that the network learns during training.
Think of the filter as a pattern detector. A filter with values like [-1, 0, 1] across columns will fire strongly wherever it finds a vertical edge — a sharp transition from dark to light. A different filter might detect horizontal edges, diagonal lines, or specific textures. Early in training, filters start random and meaningless. As the network trains, they evolve into useful pattern detectors.
The output of sliding one filter across the entire image is called a feature map — a new, smaller image where each pixel tells you "how strongly did this pattern appear at this location?" Multiple filters run in parallel, each producing its own feature map, each looking for a different pattern. A typical first layer might have 32 or 64 filters running simultaneously.
The CNN pipeline
One key property makes this work: weight sharing. The same filter slides across the entire image, so a vertical edge detector works whether the edge is in the top-left corner or the bottom-right. This dramatically reduces the number of parameters compared to a fully connected network, and it means CNNs are naturally translation-invariant — they can recognize a cat whether it's centered or off to the side.
Building up from edges to objects
A CNN is a stack of specialized layers. Each layer transforms its input, extracting increasingly abstract features. Walk through the four main operations that make a CNN work.
Step 1: Convolution
A small 3x3 filter slides across the image one pixel at a time. At each position, it computes a dot product — multiply each filter weight by the corresponding pixel value, then sum everything up. The result goes into the output feature map.
Each filter detects exactly one type of pattern. Strong positive output means "this pattern is here." Near-zero means "nothing interesting." Negative means "the opposite pattern is here."
That's the full pipeline: slide filters across the image (convolution), keep the strongest signals (pooling), repeat to build higher-level features, then classify. The same training loop from the neural networks lesson — forward pass, loss, backpropagation, weight update — teaches the filters what patterns to look for. Nobody hand-designs the filters. The network discovers the right edge detectors, texture detectors, and shape detectors entirely from data.
See convolution in action
Here's a convolution filter you can play with. Draw a pattern on the input grid (click cells to toggle them), pick a filter, and watch how the convolution produces a feature map. Use "Next Step" to advance the sliding window one position at a time, or "Auto Play" to watch it scan the whole image. Notice how different filters highlight different features of the same input — edges in different directions, sharpened features, or blurred averages.
Convolution Filter Visualizer
Click cells in the input grid to draw a pattern. Choose a filter and watch how convolution transforms the image into a feature map.
Click cells to toggle black/white
Detects horizontal edges — differences between top and bottom rows.
Try this: Load the "Box" pattern and switch between horizontal and vertical edge filters. Notice how each filter only fires where it finds its matching edge direction.
Key Takeaways
- CNNs use small, learnable filters that slide across images to detect local patterns. This is the convolution operation — a dot product between a filter and an image patch at every position.
- Feature maps are the output of convolution — they show where and how strongly each pattern was detected. Many filters run in parallel, each looking for something different.
- Deeper layers detect increasingly abstract features. Early layers find edges and textures. Middle layers combine those into shapes. Deep layers recognize whole objects. This hierarchy emerges automatically from training.
- Pooling (usually max pooling) shrinks feature maps, reducing computation and making the network less sensitive to exact pixel positions. It keeps the strongest signals while discarding spatial precision the network does not need.
- Transfer learning lets you reuse a pre-trained CNN's learned features for new tasks with minimal data. The early-layer features (edges, textures, shapes) are universal across image domains.
Common Misconceptions
- "CNNs only work on images." — While designed for images, CNNs work on any grid-structured data. They're used for audio spectrograms, time series, DNA sequences, and even text classification. The key requirement is that local spatial patterns matter.
- "More filters and more layers always improve accuracy." — Adding capacity without enough data leads to overfitting. A ResNet-152 trained on 100 images will memorize them rather than learn useful features. Architecture, data quantity, and augmentation need to be balanced.
- "CNNs understand what objects are." — A CNN that classifies cats has no concept of "cat." It has learned statistical correlations between pixel patterns and labels. Research has shown CNNs can be fooled by imperceptible perturbations (adversarial examples) that change the prediction completely while looking identical to humans.