Backpropagation (backward propagation of errors) is the algorithm used to train neural networks by efficiently computing the gradient of the loss function with respect to every weight in the network. It applies the chain rule of calculus in a reverse pass through the network — from the output layer back to the input layer — so that each weight can be updated in the direction that reduces the loss. Without backpropagation, training deep neural networks with millions of parameters would be computationally infeasible.
dL/dw = (dL/da) * (da/dz) * (dz/dw) [chain rule across layers]
LaTeX: \frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w}
| Symbol | Meaning | Unit |
|---|---|---|
| L | Loss function value | dimensionless |
| w | Weight being updated | dimensionless |
| a | Activation output of the neuron | dimensionless |
| z | Pre-activation weighted sum (z = Wx + b) | dimensionless |
Problem
A single neuron receives input x = 2, weight w = 0.5, bias b = 0, and uses ReLU activation. The target is y = 1. Loss L = (a − y)². Compute the gradient dL/dw.
Solution
Step 1 — Forward pass: z = w·x + b = 0.5 × 2 + 0 = 1.0 a = ReLU(z) = max(0, 1.0) = 1.0 L = (a − y)² = (1.0 − 1)² = 0 Step 2 — For demonstration use y = 0: L = (1.0 − 0)² = 1.0 Step 3 — Backward pass (chain rule): dL/da = 2(a − y) = 2(1.0 − 0) = 2 da/dz = ReLU'(z) = 1 (since z > 0) dz/dw = x = 2 Step 4 — Multiply: dL/dw = 2 × 1 × 2 = 4
Answer
dL/dw = 4 (the weight should be reduced to decrease the loss)
| Phase | Direction | Computation | Purpose |
|---|---|---|---|
| Forward Pass | Input → Output | Compute activations and loss | Obtain prediction |
| Backward Pass | Output → Input | Compute gradients via chain rule | Attribute error to weights |
| Weight Update | All layers | w = w − α · ∂L/∂w | Reduce loss iteratively |
| Vanishing Gradient | Deep layers | Gradients shrink exponentially | Problem in deep RNNs |
| Exploding Gradient | Deep layers | Gradients grow exponentially | Problem in deep RNNs |
Wikimedia Commons, CC BY-SA
A neural network is a computational model loosely inspired by the structure of biological brains, consisting of layers of interconnected nodes (neurons) that process and transform data. Each neuron computes a weighted sum of its inputs, applies a non-linear activation function, and passes the result to the next layer. Neural networks are the foundation of modern AI and are capable of learning highly complex patterns in images, text, audio, and tabular data.
Gradient descent is an iterative optimization algorithm that minimizes a function (such as a neural network's loss function) by repeatedly moving the parameters in the direction opposite to the gradient of the function at the current point. Because the gradient points toward the steepest ascent, subtracting it from the parameters moves the model toward a local (or global) minimum. Variants like Stochastic Gradient Descent (SGD) and Adam are the workhorses of modern deep learning training.
Deep learning is a subset of machine learning that uses neural networks with many hidden layers (hence "deep") to automatically extract hierarchical representations from raw data. Lower layers learn low-level features (edges, phonemes), while deeper layers combine them into increasingly abstract concepts (faces, words). Deep learning has revolutionized computer vision, natural language processing, and speech recognition, achieving human-level or superhuman performance on many benchmarks.
The term "backpropagation" is a contraction of "backward propagation of errors." The algorithm was independently discovered multiple times and was popularized for neural networks by David Rumelhart, Geoffrey Hinton, and Ronald Williams in their landmark 1986 paper in Nature.