Why is the gradient the direction of steepest ascent?

In my AI class, we’ve been talking about gradient descent. It makes intuitive sense - this is an optimization problem, and we’re just trying to find the minimum of the objective function.

But why does gradient descent work? Why is the gradient the direction of steepest ascent (and crucially for us, the negative gradient the direction of steepest descent a.k.a the fastest way to get to the local minimum).

Grant Sanderson made a great video about this when he was working with Khan Academy, I recommend checking it out.

Okay so to start - the notion of a partial derivative.

The partial with respect to $x$ for a function $f$ is what happens to the output of the function when you nudge the input slightly in the direction of $x$. Take some random vector $v$, the partial of $f$ with respect to $v$ is just what happens to $f(m)$ when it instead becomes $f(m + hv)$, where h tends to 0 (note that $m$ is a vector).

Now, if $\vec{v} = \begin{bmatrix}3\2\end{bmatrix}$, then $\nabla_{v} f = 3 \frac{\partial f}{\partial x} + 2 \frac{\partial f}{\partial x}$.

But this is just the dot product of $v$ and the gradient of $f$! And dot products can also be written as $v\cdot \nabla f = v   \nabla f \text{cos} \theta$, where $\theta$ is the angle between the vector and the gradient.

$\text{cos} \theta$ is maximal at $\theta = 0$. That’s just how the cosine function works. Therefore, this quantity is maximized when $\vec{v} = \nabla f$, when it is itself the gradient.

Therefore, the maximal rate of ascent - the maximal slope - happens when we increment the input to $f$ in the direction of the gradient. So the gradient is the direction of steepest ascent.