Deep Learning — Foundations: Backpropagation & gradient descent variants (SGD, Adam)

Training a neural network is the process of finding the optimal set of weights and biases that allow the network to accurately map inputs to outputs. This is an optimization problem, and it's solved using two core concepts: Gradient Descent and Backpropagation.

1. Loss Function: Measuring Error

First, we need a way to measure how "wrong" our network's predictions are. This is the job of a loss function (or cost function). The function takes the network's predictions and the true target values and outputs a single number—the loss—that quantifies the error. A higher loss means a worse performance. Common loss functions include Mean Squared Error (MSE) for regression and Cross-Entropy for classification.

The goal of training is to minimize this loss function.

2. Gradient Descent: Finding the Minimum

Imagine the loss function as a huge, hilly landscape, where the lowest point represents the set of weights and biases that gives the minimum error. Our goal is to find this lowest point.

Gradient Descent is an iterative optimization algorithm that helps us do this.

We start with a random set of weights (placing a ball somewhere on the landscape).
We calculate the gradient of the loss function. The gradient is a vector that points in the direction of the steepest ascent (uphill).
To go downhill, we take a small step in the opposite direction of the gradient. This step size is controlled by a parameter called the learning rate.
We update our weights with this new position.
We repeat steps 2-4 until we reach a minimum (the ball settles at the bottom of a valley).

3. Backpropagation: The Engine of Learning

Gradient Descent tells us how to update our weights (by moving against the gradient), but it doesn't tell us how to calculate that gradient efficiently. The gradient needs to measure how a tiny change in every single weight in the network affects the final loss.

Backpropagation (short for "backward propagation of errors") is the algorithm that does this. After making a prediction (the "forward pass"), backpropagation works as follows:

It calculates the error at the final output layer.
It then moves backward through the network, layer by layer.
At each layer, it uses the chain rule from calculus to calculate how much the weights in that layer contributed to the overall error. This contribution is the gradient for those weights.

By systematically propagating the error backward, backpropagation efficiently calculates the gradient for every single parameter in the network, telling Gradient Descent exactly how to adjust each weight to reduce the overall loss.

4. Gradient Descent Variants

Calculating the gradient using the entire dataset at once (Batch Gradient Descent) is slow. We use faster variants instead:

Stochastic Gradient Descent (SGD): Instead of the whole dataset, SGD updates the weights after processing just one training example (or a small "mini-batch"). This is much faster and the updates are "noisy," which can sometimes help the model escape shallow local minima.
Adam (Adaptive Moment Estimation): This is the most popular and often the default choice for an optimizer. Adam is an adaptive learning rate algorithm. It maintains a separate learning rate for each network parameter and adapts it as learning progresses. It combines the benefits of other optimizers (like Momentum and RMSprop) and generally converges faster and more reliably than standard SGD.

In a framework like Keras, you simply choose your optimizer and loss function when you compile the model:

Python

import tensorflow as tf
from tensorflow import keras

model = keras.Sequential([
    keras.layers.Dense(128, activation='relu', input_shape=(784,)),
    keras.layers.Dense(10, activation='softmax')
])

# Choose the optimizer (Adam is a great default) and the loss function
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

When you call model.fit(), the framework automatically handles the entire backpropagation and weight update process for you.

LearnCodePro

Deep Learning — Foundations: Backpropagation & gradient descent variants (SGD, Adam)

1. Loss Function: Measuring Error

2. Gradient Descent: Finding the Minimum

3. Backpropagation: The Engine of Learning

4. Gradient Descent Variants

Deep Learning — Foundations: Neural networks intro: neurons & activation functions

Deep Learning — Foundations: Regularization: dropout, batchnorm, early stopping

Deep Learning — Foundations: Keras (TF) quickstart vs PyTorch quickstart

Quick Navigation

This Series

Topics in Data Science, Machine Learning & AI

Categories

Learn More

Want to Track Your Progress?

1. Loss Function: Measuring Error

2. Gradient Descent: Finding the Minimum

3. Backpropagation: The Engine of Learning

4. Gradient Descent Variants

More in Deep Learning — Foundations

Deep Learning — Foundations: Neural networks intro: neurons & activation functions

Deep Learning — Foundations: Regularization: dropout, batchnorm, early stopping

Deep Learning — Foundations: Keras (TF) quickstart vs PyTorch quickstart

Quick Navigation

This Series

Topics in Data Science, Machine Learning & AI

Categories

Learn More

Want to Track Your Progress?