The Training Loop
You've now met all the pieces. In Phase 4 you built a model that turns inputs into predictions. In Phase 5 you got a loss function that measures how wrong those predictions are, and an optimizer that knows how to adjust the model's weights. This phase is where those pieces start moving. This is the ritual that actually trains a model — and it never really changes, from a three-line toy to a giant language model.
So let's get the mental model dead clear before a single line of code, because this is THE thing to internalize about PyTorch.
1. The mental model: a loop you write yourself
📝 Training is one short cycle, repeated thousands of times:
- Show the model some data → it makes predictions (the forward pass).
- Measure how wrong it was → the loss, one number.
- Compute which way to nudge each weight to make the loss smaller → the backward pass.
- Take a small step in that direction → the optimizer updates the weights.
Then do it again. And again. Each pass, the model is a little less wrong than the last. That slow, patient nudging is learning — it's the gradient descent from How a Model Learns, now made concrete in code.
Here's the part that surprises people coming from other libraries: PyTorch has no model.fit().
There's no magic "train this for me" button. You write the loop. That sounds like more work, and the
first time it's a little intimidating — but it's a gift. Nothing is hidden. You can see and change every
step, which is exactly why researchers reach for PyTorch. And the loop is short and always the same shape,
so once you've written it once, you've written it forever.
flowchart LR
A[Data X, y] --> B[pred = model X]
B --> C[loss = loss_fn pred, y]
C --> D[optimizer.zero_grad]
D --> E[loss.backward]
E --> F[optimizer.step]
F -->|next epoch| A
That diagram is the whole phase. Everything below is just making each box concrete.
2. The canonical loop
Here it is — the most important code in this entire guide. Read it slowly. We'll dissect every line afterward, but first take in the shape of it: setup, then a loop that repeats five steps.
# Some toy data: learn y = 2x. X is (4, 1), y is (4, 1).
=
=
= # one input, one output
= # mean squared error
=
# repeat the ritual 100 times
= # 1. forward: predictions
= # 2. measure wrongness
# 3. clear old gradients
# 4. backward: compute gradients
# 5. nudge the weights
epoch 0 | loss 28.4631
epoch 20 | loss 0.6118
epoch 40 | loss 0.1370
epoch 60 | loss 0.0312
epoch 80 | loss 0.0072
What just happened: The setup ran once — a model, a loss function, an optimizer wired to the model's
parameters. Then the loop ran 100 times, and each pass did the same five steps: predict, measure loss,
clear gradients, compute new gradients, step. The thing to feel is the loss column: it starts at 28.46
(the untrained model is very wrong) and falls toward zero. That falling number is the model learning
y = 2x. Nothing here is special to this toy — swap in a deep network and a real dataset and the loop
body is identical.
💡 Notice loss.item() in the print. loss is a tensor (a zero-dim one, from
Phase 2); .item() pulls out the plain Python number for printing.
Get in the habit — logging the raw tensor every step also quietly holds onto its computation graph and
wastes memory.
3. Epochs and batches
Two words you'll see everywhere, and they're simpler than they sound.
📝 An epoch is one full pass over your entire dataset. The loop above ran 100 epochs — it showed the model all four examples, 100 times over. More epochs means more chances to learn (up to a point — past that, the model starts memorizing instead of learning, which is overfitting).
📝 A batch is a chunk of the data processed together in one forward/backward pass. In the toy loop we fed all four examples at once, so the whole dataset was one batch. Real datasets are far too big for that, so each epoch is split into many batches, and you loop over them inside the epoch:
# inner loop: one batch at a time
=
=
What just happened: The five steps didn't change at all — they just moved one level deeper, inside an
inner loop over batches. Each X_batch is a slice of the data; one trip through the inner loop is one
weight update. One trip through the outer loop is one epoch (every batch seen once). That data_loader
is the piece we haven't built yet — it's the Dataset & DataLoader of
Phase 7, which hands you batches automatically.
Why bother with batches instead of the whole dataset at once? Two reasons. Memory: a million images won't fit on your GPU all at once, but a batch of 64 will. Better learning: updating the weights after every small batch — rather than once per full pass — gives many more, slightly-noisy steps, and that noise actually helps the model find better solutions. So batching isn't a compromise; it's how modern training is supposed to work.
4. Order matters — the #1 beginner bug
Look back at the three middle lines:
# 3. clear old gradients
# 4. compute new gradients
# 5. apply them
⚠️ That order is not optional. zero_grad → backward → step, every single time. Getting it wrong
is the most common way a beginner's training silently breaks. Let's spell out exactly what each line does
and what happens without it.
-
loss.backward()runs the backward pass from Phase 3. It walks the computation graph and computes, for every weight, the gradient — the direction that would make the loss bigger. After this line, each parameter has its.gradfilled in. Without it, there are no gradients at all, so the next line has nothing to act on and the model never changes. -
optimizer.step()uses those.gradvalues to nudge each weight a small amount in the loss-reducing direction. This is the actual learning. Without it, you compute perfect gradients and then ignore them — the model stays frozen. -
optimizer.zero_grad()is the one beginners forget, and it's the subtle one. Here's the trap: in PyTorch,backward()adds the new gradients to whatever is already in.grad— it accumulates, it doesn't overwrite (this is the gradient-accumulation behavior from Phase 3). So if you don't reset to zero before eachbackward(), this step's gradients pile on top of last step's, and last step's, and so on. Your weight updates get computed from a stale, ever-growing sum of gradients, the steps go haywire, and training quietly falls apart — no error, just a loss that refuses to fall or explodes. Withoutzero_grad(), the loop runs fine and the result is garbage, which is the worst kind of bug.
Here's what forgetting it looks like:
# BROKEN: no optimizer.zero_grad()
=
=
# gradients ACCUMULATE across every epoch
epoch 0 | loss 28.4631
epoch 20 | loss 1453.8079
epoch 40 | loss 98211.4453
epoch 60 | loss nan
epoch 80 | loss nan
What just happened: With no zero_grad(), each backward() added its gradients to the leftover pile
from all previous epochs. The "step" the optimizer took kept growing, overshooting wildly, until the
numbers blew up to nan (not-a-number). Same model, same data, same learning rate as the working loop in
section 2 — the only difference is the missing reset line. That's how much one line matters. When your
loss explodes to nan, "did I forget zero_grad()?" should be your first thought.
💡 The order is a tiny story: clear the slate (zero_grad), figure out which way to go
(backward), take the step (step). Say it that way once and you'll never reorder it.
5. Tracking progress, and train vs. eval
The loss printout isn't decoration — it's your only window into whether training is working. A healthy
run shows the loss falling and roughly leveling off. If it doesn't fall, something is wrong, and it's
almost always one of three things: the learning rate is off (too high → it explodes; too low → it barely
moves — see Phase 5), there's a bug in the loop (often the zero_grad one),
or the data is bad. Print the loss every epoch from day one; it's the cheapest diagnostic you have.
There's also a part of training the toy loop skipped: checking how the model does on data it didn't train on. A model that aces its training data but flops on new data hasn't learned — it's memorized (overfitting again). So you hold out some data and evaluate on it. Two PyTorch habits make that correct:
📝 model.train() and model.eval() flip the model between two modes. Some layers behave
differently while training versus while being evaluated (dropout, batch-norm — you'll meet them later),
so you tell the model which phase it's in. Call model.train() before the training loop and
model.eval() before you evaluate.
📝 torch.no_grad() wraps your evaluation code so PyTorch doesn't build the computation graph or
track gradients. You're only measuring here, not learning — there's nothing to update — so skipping the
gradient bookkeeping makes evaluation faster and lighter.
# training mode
=
=
# evaluation mode
# no gradients needed -- just measuring
=
# ... compute accuracy or test loss on held-out data ...
What just happened: The training loop is exactly the five-step ritual you already know, bracketed by
model.train(). Then model.eval() switches modes, and torch.no_grad() turns off gradient tracking
for the measurement pass over held-out test data — no backward(), no step(), because we're judging
the model, not training it. This train-then-evaluate shape is the skeleton of the real classifier you'll
build in Phase 8.
💡 Here's the payoff to sit with: every model is trained by a scaled-up version of this exact loop.
The image recognizers, the recommendation engines, the giant LLMs behind the chatbots you use — under the
hood, they all run forward → loss → zero_grad → backward → step, over and over, across mountains of
data and hardware. The scale is staggering; the ritual is the one you just learned. Master these five
lines and you understand how all of deep learning actually trains.
Recap
- Training is a loop you write yourself — PyTorch has no
model.fit(). The body is always the same five steps: forward → loss →zero_grad→backward→step, repeated many times. - An epoch is one full pass over the data; a batch is a chunk processed in one step. Real training loops over batches inside each epoch — for memory and for better, noisier learning.
- Order is non-negotiable:
optimizer.zero_grad()→loss.backward()→optimizer.step(). Forgettingzero_grad()lets gradients accumulate across steps and silently wrecks training (loss explodes tonan). - Watch the loss fall — it's your main diagnostic. If it doesn't drop, suspect the learning rate, a loop bug, or the data.
model.train()vs.model.eval()sets the model's mode; evaluate held-out data undertorch.no_grad()since you're measuring, not learning.- Every model — up to giant LLMs — trains with a scaled-up version of this exact loop.
Quick check
[
{
"q": "What is the correct order of the three middle steps in a PyTorch training loop?",
"choices": ["loss.backward() → optimizer.zero_grad() → optimizer.step()", "optimizer.zero_grad() → loss.backward() → optimizer.step()", "optimizer.step() → loss.backward() → optimizer.zero_grad()"],
"answer": 1,
"explain": "Clear the slate (zero_grad), compute which way to go (backward), then take the step (step). Any other order breaks training."
},
{
"q": "You remove optimizer.zero_grad() from your loop. What's the most likely result?",
"choices": ["A clear RuntimeError that stops the program immediately", "Nothing changes — zero_grad() is optional cleanup", "The loss climbs or explodes to nan, because gradients accumulate across steps"],
"answer": 2,
"explain": "backward() ADDS to .grad rather than overwriting it. Without zero_grad(), gradients pile up every step, the updates overshoot, and the loss blows up — usually with no error at all."
},
{
"q": "Why wrap evaluation code in `with torch.no_grad()`?",
"choices": ["You're only measuring, not learning, so there's no need to track gradients — it saves memory and time", "It makes the model more accurate on the test set", "It is required or backward() will throw an error"],
"answer": 0,
"explain": "During evaluation there's nothing to update, so building the computation graph is wasted work. no_grad() skips gradient tracking, making the pass faster and lighter."
}
]
← Phase 5: Loss Functions & Optimizers · Guide overview · Phase 7: Data: Dataset & DataLoader →
Check your understanding
1. What is the correct order of the three middle steps in a PyTorch training loop?
2. You remove optimizer.zero_grad() from your loop. What's the most likely result?
3. Why wrap evaluation code in `with torch.no_grad()`?