Computer ScienceAI & Machine LearningMedium

Gradient Descent

Also known as:Steepest DescentFirst-Order Optimization

Gradient descent is an iterative optimization algorithm that minimizes a function (such as a neural network's loss function) by repeatedly moving the parameters in the direction opposite to the gradient of the function at the current point. Because the gradient points toward the steepest ascent, subtracting it from the parameters moves the model toward a local (or global) minimum. Variants like Stochastic Gradient Descent (SGD) and Adam are the workhorses of modern deep learning training.

Key Formula

theta[t+1] = theta[t] - learning_rate * gradient_of_loss_at_theta[t]

LaTeX: \theta_{t+1} = \theta_t - \alpha \nabla_\theta L(\theta_t)

SymbolMeaningUnit
\theta_tModel parameters (weights and biases) at step tdimensionless
\alphaLearning rate (step size)dimensionless
\nabla_\theta LGradient of the loss with respect to parametersdimensionless

Worked Example

Problem

Minimize the function L(w) = w² + 4w + 4 using gradient descent. Start at w₀ = 3 with learning rate α = 0.1. Perform 3 update steps.

Solution

Step 0 — Gradient: dL/dw = 2w + 4 Step 1 (t=0): w = 3 Gradient = 2(3) + 4 = 10 w₁ = 3 − 0.1 × 10 = 3 − 1 = 2.0 Step 2 (t=1): w = 2 Gradient = 2(2) + 4 = 8 w₂ = 2 − 0.1 × 8 = 2 − 0.8 = 1.2 Step 3 (t=2): w = 1.2 Gradient = 2(1.2) + 4 = 6.4 w₃ = 1.2 − 0.1 × 6.4 = 1.2 − 0.64 = 0.56 Note: The true minimum is at w = −2 (where dL/dw = 0), so the algorithm converges toward −2.

Answer

After 3 steps: w ≈ 0.56, converging toward minimum at w = −2

Gradient Descent Variants Compared

VariantBatch SizeUpdate FrequencyPros / Cons
Batch GDFull datasetOnce per epochStable but slow on large data
Stochastic GD (SGD)1 sampleOnce per sampleFast but noisy updates
Mini-batch GD32–512 samplesOnce per mini-batchBest trade-off, most used
AdamMini-batchAdaptive per parameterFast convergence, widely used
RMSPropMini-batchAdaptive learning rateGood for RNNs

Interactive Tools

Desmos Graphing Calculator

Visualize loss surfaces and gradient descent trajectories interactively

Open Tool

Khan Academy — Gradient

Prerequisites: understanding gradients in multivariable calculus

Open Tool

Brilliant.org — Optimization

Interactive lessons on optimization algorithms including gradient descent

Open Tool
Contour plot showing gradient descent steps converging to the minimum of a loss surface

Wikimedia Commons, CC BY-SA

Related Terms

Computer Science

Backpropagation

Backpropagation (backward propagation of errors) is the algorithm used to train neural networks by efficiently computing the gradient of the loss function with respect to every weight in the network. It applies the chain rule of calculus in a reverse pass through the network — from the output layer back to the input layer — so that each weight can be updated in the direction that reduces the loss. Without backpropagation, training deep neural networks with millions of parameters would be computationally infeasible.

Computer Science

Neural Network

A neural network is a computational model loosely inspired by the structure of biological brains, consisting of layers of interconnected nodes (neurons) that process and transform data. Each neuron computes a weighted sum of its inputs, applies a non-linear activation function, and passes the result to the next layer. Neural networks are the foundation of modern AI and are capable of learning highly complex patterns in images, text, audio, and tabular data.

Computer Science

Overfitting

Overfitting occurs when a machine learning model learns the training data too well — including its noise and random fluctuations — to the point where it performs poorly on new, unseen data. An overfitted model has high training accuracy but low validation/test accuracy, indicating it has memorized patterns specific to the training set rather than generalizing. Overfitting is more likely with complex models, small datasets, or insufficient regularization.

"Gradient" derives from Latin "gradus" (step). The gradient descent algorithm in the context of optimization was described by Augustin-Louis Cauchy in 1847 as the "method of steepest descent." Its application to neural network training was established in the 1980s.

gradient-descentoptimizationlearning-ratesgdtraining