Optimization and What Is the Best I Can Do

Q: What is a critical point of a function?

A point where the derivative is zero

Q: In gradient descent, what does the learning rate control?

The size of each step taken in the direction opposite the gradient

Q: What is the difference between a local maximum and a global maximum?

A local maximum is the best point in its neighborhood; a global maximum is the best point overall

The hill you are climbing

Picture yourself on a hill in the fog. You cannot see the top or the bottom. You can only feel the slope under your feet. Your goal is to reach the highest point.

What do you do? You look at the slope. If the ground rises to your right, you step right. If it rises to your left, you step left. You keep stepping in the direction that goes up, and eventually you reach a peak.

That is optimization. You have a function - the height of the hill - and you want to find its maximum. The derivative tells you which way is up. You follow it until you can go no higher.

The same idea works for finding minima. If you want the lowest point in a valley, you follow the slope downward.

Where the derivative is zero

At the very top of a hill, the ground is flat. The slope is zero. At the very bottom of a valley, the ground is also flat. The slope is zero.

These flat spots are called critical points. They are the only places where a smooth function can have a maximum or a minimum. To find the best outcome, you find where the derivative is zero and then check whether that point is a peak, a valley, or a flat plain.

f(x) = x^2
f'(x) = 2x
Set f'(x) = 0: 2x = 0, so x = 0.
At x = 0, f(0) = 0. This is the minimum of the parabola.

The derivative told us exactly where the minimum is. No guessing, no plotting a hundred points. One equation, one solution.

Local maxima vs global maxima

Not every flat spot is the best spot. A hill can have many local peaks. Each local peak is higher than the ground immediately around it, but a distant mountain may be taller.

In optimization, a local maximum is the best point in its neighborhood. A global maximum is the best point overall. Finding a local maximum is easy: follow the slope upward until it flattens. Finding the global maximum is hard, because you might have to explore the whole landscape.

This is the central challenge of machine learning. A neural network has millions of parameters. The loss function - the measure of how wrong the network is - is a surface in a million-dimensional space. Training the network means finding the lowest point on that surface. But the surface has many local valleys, and gradient descent can get stuck in one of them.

Gradient descent: the algorithm that learns

Gradient descent is the workhorse of machine learning. It is the algorithm that trains neural networks, fits regression models, and optimizes recommendation systems.

The idea is identical to the hill-climbing metaphor, but in higher dimensions. Instead of a slope (one number), you have a gradient (a vector of partial derivatives). The gradient points in the direction of steepest ascent. To minimize a function, you step in the opposite direction.

new_position = old_position - learning_rate * gradient

The learning rate controls how big a step you take. Too small, and you crawl. Too large, and you overshoot the minimum, bouncing back and forth or even flying off to infinity.

Each step reduces the loss a little. Repeat thousands of times, and the model converges to a local minimum. That is "learning."

A tiny example: fitting a line

Suppose you have data points that roughly follow a line. You want to find the slope and intercept that make the line fit the data best. The "best fit" is the line that minimizes the sum of squared errors - the distance between each data point and the line.

This is an optimization problem. The function you are minimizing is the sum of squared errors. The variables are the slope and the intercept. The derivative of the error with respect to each variable tells you how to nudge that variable to reduce the error.

You do not need to solve this by hand. Libraries do it for you. But the underlying mechanism is gradient descent: compute the gradient, take a step opposite the gradient, repeat.

See it run

Here is a tiny gradient descent implementation that finds the minimum of f(x) = (x - 3)^2. The minimum is at x = 3.

def f(x):
    return (x - 3) ** 2

def derivative_f(x):
    return 2 * (x - 3)

x = 0.0          # starting point
learning_rate = 0.1

for i in range(20):
    grad = derivative_f(x)
    x = x - learning_rate * grad
    print(f"Step {i+1}: x = {x:.4f}, f(x) = {f(x):.4f}")

print("Final x:", x)

What just happened: The function f(x) = (x - 3)^2 has its minimum at x = 3. Starting from x = 0, the gradient descent algorithm took 20 steps, each time moving opposite the gradient (which points toward the minimum). The learning rate of 0.1 controlled the step size. By step 20, x had converged to approximately 3.0000, and f(x) was essentially zero.

For builders

Optimization is not just for math class. It is the engine behind much of the software you write and use.

Machine learning - Training any model is optimization. The model has parameters. The loss function measures error. Gradient descent (or a variant) finds the parameters that minimize the loss.
Hyperparameter tuning - Choosing the learning rate, the batch size, the number of layers: all of these are optimization problems. You want the settings that produce the best validation score.
Performance tuning - Finding the fastest configuration for a database query, the optimal cache size, or the best thread pool size: all optimization.
Operations research - Scheduling, routing, inventory management: all about finding the best allocation of limited resources.
Game AI - Pathfinding in a game is often optimization: find the path with the lowest cost, where cost might be distance, time, or risk.

The key insight: optimization is the search for the best outcome. Calculus makes it systematic by showing that the best outcome happens where the derivative is zero. Follow the gradient, and you will find a peak or a valley. The challenge is knowing whether it is the best one.

What we have built

A derivative is the rate of change at a single point.
A critical point is where the derivative is zero - a candidate for a maximum or minimum.
A local maximum is the best point in its neighborhood. A global maximum is the best point overall.
Gradient descent finds a minimum by stepping opposite the gradient, repeated until convergence.
The learning rate controls step size. Too small is slow. Too large is unstable.
In code, gradient descent trains neural networks, fits models, and tunes hyperparameters.

A quick check before you move on:

[
  {
    "q": "What is a critical point of a function?",
    "choices": ["A point where the function is undefined", "A point where the derivative is zero", "A point where the function reaches its global maximum", "A point where the function is increasing fastest"],
    "answer": 1,
    "explain": "A critical point is a point where the derivative is zero (or undefined). At a smooth maximum or minimum, the slope is flat, so the derivative is zero. Not every critical point is a maximum or minimum, but every smooth maximum or minimum is a critical point."
  },
  {
    "q": "In gradient descent, what does the learning rate control?",
    "choices": ["The number of iterations", "The size of each step taken in the direction opposite the gradient", "The initial value of the parameters", "The dimensionality of the problem"],
    "answer": 1,
    "explain": "The learning rate controls how large a step you take at each iteration. A small learning rate means slow but stable convergence. A large learning rate means faster progress but risk of overshooting the minimum or diverging."
  },
  {
    "q": "What is the difference between a local maximum and a global maximum?",
    "choices": ["A local maximum is higher than a global maximum", "A local maximum is the best point in its neighborhood; a global maximum is the best point overall", "A local maximum is for functions of one variable; a global maximum is for functions of many variables", "There is no difference; they are the same thing"],
    "answer": 1,
    "explain": "A local maximum is higher than all nearby points, but a distant peak may be taller. A global maximum is the highest point in the entire domain. Gradient descent is guaranteed to find a local maximum, but not necessarily the global one."
  }
]

← Phase 1: Derivatives as Right Now Speed · Guide overview · Phase 3: Integrals as the Total So Far →

Check your understanding 3 questions

1. What is a critical point of a function?

2. In gradient descent, what does the learning rate control?

3. What is the difference between a local maximum and a global maximum?