Backpropagation — How Neural Networks Learn

Backpropagation is the heart of every neural network - an elegant algorithm that enables machines to learn from their own mistakes. You’ll understand how this mathematical principle drives the learning process from image recognition to language models.

How Backpropagation Works in Neural Networks¶

Backpropagation is an algorithm that allows neural networks to learn from errors by gradually propagating gradients back through the network. Without this mechanism, deep learning wouldn’t exist in its current form. Let’s look at exactly how it works and why it’s so important.

Basic Principle of Forward and Backward Pass¶

Neural network learning occurs in two phases. First, the forward pass — data flows through the network forward and creates predictions. Then follows the backward pass — the error propagates back and weights are updated.

# Backpropagation — How Neural Networks Learn
import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def forward_pass(X, W1, b1, W2, b2):
    # Hidden layer
    z1 = X.dot(W1) + b1
    a1 = sigmoid(z1)

    # Output layer
    z2 = a1.dot(W2) + b2
    a2 = sigmoid(z2)

    return z1, a1, z2, a2

Computing Error and Gradients¶

The key is understanding how gradients are calculated. We use the chain rule from mathematical analysis — we decompose the derivative of a composite function into a product of individual derivatives.

For mean squared error and sigmoid activation, the gradient looks like this:

def compute_gradients(X, y, z1, a1, z2, a2, W1, W2):
    m = X.shape[0]  # number of samples

    # Gradient for output layer
    dz2 = a2 - y  # derivative of MSE loss
    dW2 = (1/m) * a1.T.dot(dz2)
    db2 = (1/m) * np.sum(dz2, axis=0)

    # Gradient for hidden layer (chain rule)
    da1 = dz2.dot(W2.T)
    dz1 = da1 * a1 * (1 - a1)  # derivative of sigmoid
    dW1 = (1/m) * X.T.dot(dz1)
    db1 = (1/m) * np.sum(dz1, axis=0)

    return dW1, db1, dW2, db2

Weight Updates Using Gradient Descent¶

Once we have the gradients, we can update the weights. Gradient descent adjusts each weight in the opposite direction to where the gradient points — this gradually gets us to the minimum of the loss function.

def update_weights(W1, b1, W2, b2, dW1, db1, dW2, db2, learning_rate):
    W1 -= learning_rate * dW1
    b1 -= learning_rate * db1
    W2 -= learning_rate * dW2
    b2 -= learning_rate * db2

    return W1, b1, W2, b2

# Complete training cycle
def train_step(X, y, W1, b1, W2, b2, learning_rate=0.01):
    # Forward pass
    z1, a1, z2, a2 = forward_pass(X, W1, b1, W2, b2)

    # Compute loss
    loss = np.mean((a2 - y)**2)

    # Backward pass
    dW1, db1, dW2, db2 = compute_gradients(X, y, z1, a1, z2, a2, W1, W2)

    # Update weights
    W1, b1, W2, b2 = update_weights(W1, b1, W2, b2, 
                                    dW1, db1, dW2, db2, learning_rate)

    return W1, b1, W2, b2, loss

Problems with Vanishing and Exploding Gradients¶

In deep networks, vanishing gradients can occur — gradients diminish exponentially during backpropagation. The opposite problem is exploding gradients, where gradients grow infinitely.

Solutions include:

Gradient clipping — limiting the maximum gradient magnitude
Better activation functions — ReLU instead of sigmoid
Batch normalization — normalizing inputs to each layer
Residual connections — direct connections between distant layers

# Gradient clipping in practice
def clip_gradients(gradients, max_norm=1.0):
    total_norm = 0
    for grad in gradients:
        total_norm += np.sum(grad**2)
    total_norm = np.sqrt(total_norm)

    clip_coef = max_norm / (total_norm + 1e-6)
    if clip_coef < 1:
        for i, grad in enumerate(gradients):
            gradients[i] = grad * clip_coef

    return gradients

Backpropagation Optimization¶

Modern implementations use advanced optimizers that adjust the learning rate or add momentum:

class AdamOptimizer:
    def __init__(self, learning_rate=0.001, beta1=0.9, beta2=0.999):
        self.lr = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.m = {}  # first moment
        self.v = {}  # second moment
        self.t = 0   # time step

    def update(self, params, gradients):
        self.t += 1

        for key in params:
            if key not in self.m:
                self.m[key] = np.zeros_like(params[key])
                self.v[key] = np.zeros_like(params[key])

            # Update biased first/second moment estimates
            self.m[key] = self.beta1 * self.m[key] + (1 - self.beta1) * gradients[key]
            self.v[key] = self.beta2 * self.v[key] + (1 - self.beta2) * gradients[key]**2

            # Bias correction
            m_hat = self.m[key] / (1 - self.beta1**self.t)
            v_hat = self.v[key] / (1 - self.beta2**self.t)

            # Update parameters
            params[key] -= self.lr * m_hat / (np.sqrt(v_hat) + 1e-8)

Backpropagation in Practice with PyTorch¶

In real projects, we use frameworks that implement backpropagation automatically:

import torch
import torch.nn as nn

# Network definition
class SimpleNet(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.hidden = nn.Linear(input_size, hidden_size)
        self.output = nn.Linear(hidden_size, output_size)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.sigmoid(self.hidden(x))
        x = self.sigmoid(self.output(x))
        return x

# Training with automatic backpropagation
model = SimpleNet(10, 20, 1)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

for epoch in range(1000):
    # Forward pass
    outputs = model(X)
    loss = criterion(outputs, y)

    # Backward pass - automatically!
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Summary¶

Backpropagation is an algorithm that allows neural networks to learn by minimizing error through gradient descent. The key components are three steps: forward pass to compute predictions, backward pass to propagate gradients, and weight updates. In practice, we use optimizers like Adam and frameworks like PyTorch that automate the entire process. However, understanding backpropagation principles is essential for debugging and optimizing deep learning models.

backpropagationgradient descentdeep learning

CORE SYSTEMS team

We build core systems and AI agents that keep operations running. 15 years of experience with enterprise IT.

All articles

Backpropagation — How Neural Networks Learn

How Backpropagation Works in Neural Networks¶

Basic Principle of Forward and Backward Pass¶

Computing Error and Gradients¶

Weight Updates Using Gradient Descent¶

Problems with Vanishing and Exploding Gradients¶

Backpropagation Optimization¶

Backpropagation in Practice with PyTorch¶

Summary¶

CORE SYSTEMS team

More know-how

What Is a Neural Network — Principles and Your First Model in Python

TensorFlow: Google makes machine learning accessible to everyone

Batch Normalization — Stable Network Training

Agentic AI — Where Enterprise AI Is Heading in 2026