Training Deep Networks
Loss functions, optimizers, regularization, overfitting
Prerequisites
- neural-networks
Building a neural network is easy. Training one? That's where things get interesting. It's like tuning a guitar with a million strings — except you can't hear most of them, and turning one affects all the others. The good news: we've figured out some really clever tricks to make this work.
In the last lesson, you saw the basic training loop: forward pass, loss, backpropagation, weight update, repeat. But that's just the skeleton. The real craft of deep learning lives in the details: which loss function to use, how to update the weights, how fast to move, and how to stop the network from memorizing your training data instead of actually learning.
This lesson is about those details — the practical toolkit that makes the difference between a model that works and one that doesn't.
The training loop
Every neural network learns the same way. You show it data, it makes a prediction, you measure how wrong it was, and you nudge the weights to do better next time. Here's the loop:
The training loop
The loss function is the heart of this loop. It's a single number that tells you how wrong your network is. Everything else — the optimizer, the learning rate, the regularization — is in service of making that number smaller. If you pick the wrong loss function, it doesn't matter how good your optimizer is. You'll be solving the wrong problem.
Think of the loss as a landscape — a terrain of mountains and valleys. Training is the process of finding the lowest valley. The loss function defines the shape of the terrain. The optimizer decides how you walk through it. The learning rate controls how big your steps are. And regularization makes sure you don't just memorize the map instead of learning to navigate.
Making training actually work
The training loop is simple in theory. Making it work in practice requires four pieces, each with its own set of choices and tradeoffs. Click through each one.
Step 1: Loss Functions — Measuring wrongness
The loss function translates "how wrong is the network?" into a number. Different tasks need different measures of wrongness.
"Is this a cat or a dog?" Penalizes confident wrong answers heavily. If the network says 99% cat when it's a dog, the loss is enormous.
"What's the house price?" The average of squared differences between predicted and actual values. Big errors get punished quadratically.
The goal is always the same: make this number smaller. A loss of 0 means perfect predictions (which you'll never actually achieve on real data, and shouldn't want to — more on that later).
These four pieces — loss function, optimizer, learning rate, and regularization — are the toolkit of every deep learning practitioner. Getting them right is more art than science, which is why the community has converged on sensible defaults: Adam optimizer, learning rate around 0.001 with scheduling, dropout of 0.1-0.3, and batch normalization after each layer. Start there, and adjust based on what you see.
Now try it yourself
Here's a 2D loss landscape. The dark blue regions have low loss (good) and the warm regions have high loss (bad). The white ball represents your model's weights during training. Try different learning rates and optimizers to see how training behavior changes. Can you find the global minimum? Watch what happens when the learning rate is too high — the ball overshoots and bounces around instead of converging.
Loss Landscape Explorer
Watch gradient descent navigate a loss landscape. Adjust the learning rate and optimizer to see how training behavior changes.
Training Status
Try this: Set learning rate to 0.8 with SGD and watch the ball overshoot. Then switch to Adam and see how momentum smooths the path.
Key Takeaways
- The loss function measures how wrong your model is. Cross-entropy for classification, MSE for regression. Everything else in training is about making this number smaller.
- Optimizers decide how to update weights. SGD follows the gradient directly; Adam adds momentum and adaptive step sizes, which is why it's the default choice for most tasks.
- The learning rate is the most important hyperparameter. Too high and training explodes; too low and it stalls. Learning rate scheduling (start high, reduce over time) gives you the best of both worlds.
- Regularization prevents overfitting. Dropout randomly disables neurons during training. Batch normalization stabilizes layer outputs. Both help the network learn general patterns instead of memorizing training data.
- The gap between training loss and validation loss tells you everything. If training loss is low but validation loss is high, you're overfitting. If both are high, you're underfitting. Monitor both.
Common Misconceptions
- "A lower learning rate is always safer." -- Not quite. Too low and training gets stuck in local minima or takes impractically long. The best strategy is to start relatively high and reduce over time (learning rate scheduling). Sometimes a too-low learning rate produces a worse model than a moderately high one.
- "More training always means a better model." -- After a point, continued training makes the model memorize training data (overfitting) rather than learning generalizable patterns. This is why we use validation sets and early stopping. The best model is often not the one trained the longest.
- "Adam is always better than SGD." -- Adam converges faster and is less sensitive to hyperparameters, but SGD with momentum can sometimes find flatter minima that generalize better. In practice, Adam is the safe default, but SGD remains competitive for certain tasks like training vision models with very large datasets.