Computer ScienceAI & Machine LearningMedium

Backpropagation

Also known as:BackpropReverse-Mode Automatic Differentiation

Backpropagation (backward propagation of errors) is the algorithm used to train neural networks by efficiently computing the gradient of the loss function with respect to every weight in the network. It applies the chain rule of calculus in a reverse pass through the network — from the output layer back to the input layer — so that each weight can be updated in the direction that reduces the loss. Without backpropagation, training deep neural networks with millions of parameters would be computationally infeasible.

Key Formula

dL/dw = (dL/da) * (da/dz) * (dz/dw) [chain rule across layers]

LaTeX: \frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w}

Symbol	Meaning	Unit
L	Loss function value	dimensionless
w	Weight being updated	dimensionless
a	Activation output of the neuron	dimensionless
z	Pre-activation weighted sum (z = Wx + b)	dimensionless

Worked Example

Problem

A single neuron receives input x = 2, weight w = 0.5, bias b = 0, and uses ReLU activation. The target is y = 1. Loss L = (a − y)². Compute the gradient dL/dw.

Solution

Step 1 — Forward pass: z = w·x + b = 0.5 × 2 + 0 = 1.0 a = ReLU(z) = max(0, 1.0) = 1.0 L = (a − y)² = (1.0 − 1)² = 0 Step 2 — For demonstration use y = 0: L = (1.0 − 0)² = 1.0 Step 3 — Backward pass (chain rule): dL/da = 2(a − y) = 2(1.0 − 0) = 2 da/dz = ReLU'(z) = 1 (since z > 0) dz/dw = x = 2 Step 4 — Multiply: dL/dw = 2 × 1 × 2 = 4

Answer

dL/dw = 4 (the weight should be reduced to decrease the loss)

Forward Pass vs Backward Pass in Backpropagation

Phase	Direction	Computation	Purpose
Forward Pass	Input → Output	Compute activations and loss	Obtain prediction
Backward Pass	Output → Input	Compute gradients via chain rule	Attribute error to weights
Weight Update	All layers	w = w − α · ∂L/∂w	Reduce loss iteratively
Vanishing Gradient	Deep layers	Gradients shrink exponentially	Problem in deep RNNs
Exploding Gradient	Deep layers	Gradients grow exponentially	Problem in deep RNNs

Interactive Tools

TensorFlow Playground

Interactive visualization of how weights change during backpropagation

Open Tool

Khan Academy — Chain Rule

Prerequisites: understanding the chain rule in calculus

Open Tool

Brilliant.org — Neural Networks

Step-by-step visual walkthrough of backpropagation

Open Tool

Wikimedia Commons, CC BY-SA

Related Terms

Computer Science

Neural Network

A neural network is a computational model loosely inspired by the structure of biological brains, consisting of layers of interconnected nodes (neurons) that process and transform data. Each neuron computes a weighted sum of its inputs, applies a non-linear activation function, and passes the result to the next layer. Neural networks are the foundation of modern AI and are capable of learning highly complex patterns in images, text, audio, and tabular data.

Computer Science

Gradient Descent

Gradient descent is an iterative optimization algorithm that minimizes a function (such as a neural network's loss function) by repeatedly moving the parameters in the direction opposite to the gradient of the function at the current point. Because the gradient points toward the steepest ascent, subtracting it from the parameters moves the model toward a local (or global) minimum. Variants like Stochastic Gradient Descent (SGD) and Adam are the workhorses of modern deep learning training.

Computer Science

Deep Learning

Deep learning is a subset of machine learning that uses neural networks with many hidden layers (hence "deep") to automatically extract hierarchical representations from raw data. Lower layers learn low-level features (edges, phonemes), while deeper layers combine them into increasingly abstract concepts (faces, words). Deep learning has revolutionized computer vision, natural language processing, and speech recognition, achieving human-level or superhuman performance on many benchmarks.

The term "backpropagation" is a contraction of "backward propagation of errors." The algorithm was independently discovered multiple times and was popularized for neural networks by David Rumelhart, Geoffrey Hinton, and Ronald Williams in their landmark 1986 paper in Nature.

backpropagationgradientchain-ruletrainingneural-network