Backpropagation in CNN

10 minute read

Published:

Understanding Backpropagation in Convolutional Neural Networks (CNNs)

Backpropagation in CNNs is a fundamental yet conceptually challenging topic that is crucial for understanding how these networks learn and optimise their parameters. It’s noted that detailed content on this specific topic isn’t always readily available, even in textbooks .

The core idea is to apply the gradient descent algorithm to find the optimal values for the network’s trainable parameters that minimise the loss function. Backpropagation is the process of calculating the gradients of the loss with respect to these parameters.

Simple CNN Architecture Example

The source uses a simple CNN architecture as a running example to explain the backpropagation process. The architecture includes the following layers in sequence:

  1. Convolutional Layer: An input image (X), e.g., 6x6 pixels, is convolved with a filter (W1), e.g., 3x3, and a bias (B1) is added to produce an intermediate result (Z1). The shape changes from 6x6 to 4x4 after convolution with a 3x3 filter and unit stride.
  2. ReLU Activation: A Rectified Linear Unit (ReLU) operation is applied element-wise to Z1, introducing non-linearity and producing A1. The shape remains 4x4 .
  3. Max-Pooling Layer: Max-pooling is applied to A1, e.g., using a 2x2 window and stride 2, to reduce the spatial dimensions and produce P1. The shape changes from 4x4 to 2x2.
  4. Flatten Layer: The 2x2 P1 output is flattened into a 1D vector (F), which has 4 elements.
  5. Fully Connected Layer (Single Neuron): The flattened vector F is fed into a single output neuron. This involves a dot product with weights (W2, 4x1) and adding a bias (B2) to calculate Z2.
  6. Sigmoid Activation: A sigmoid activation function is applied to Z2, producing A2, which is the network’s prediction (y_hat). For a single image, this output is a scalar (1x1) representing the probability in a binary classification problem.
  7. Loss Function: The prediction A2 (y_hat) is compared to the true label (Y) using a loss function. For a binary classification problem, binary cross-entropy is used. The loss (L) quantifies the error.

A logical diagram helps visualise this flow: X -> (Conv/W1+B1) -> Z1 -> (ReLU) -> A1 -> (MaxPool) -> P1 -> (Flatten) -> F -> (W2+B2) -> Z2 -> (Sigmoid) -> A2 (y_hat) -> L.

Trainable Parameters

In this simple architecture, the trainable parameters are located in the Convolutional layer and the Fully Connected layer. These are:

  • The 3x3 filter weights (W1), which has 9 weights, plus a bias (B1).
  • The weights connecting the flattened layer to the output neuron (W2), which has 4 weights (since the input F has 4 elements), plus a bias (B2).

In total, there are 15 trainable parameters (9 for W1 + 1 for B1 + 4 for W2 + 1 for B2) that the network needs to learn during training to minimise the loss.

Forward Propagation Equations

Understanding forward propagation is necessary for backpropagation. The key equations from the source are:

  • Z1 = Convolution(X, W1) + B1
  • A1 = ReLU(Z1)
  • P1 = MaxPool(A1)
  • F = Flatten(P1)
  • Z2 = DotProduct(F, W2) + B2 (or F ⋅ W2 + B2)
  • A2 = Sigmoid(Z2) (This is the prediction, y_hat)
  • L = Loss(Y, A2) (e.g., Binary Cross-Entropy)

Backpropagation: Calculating Gradients

The goal of backpropagation is to calculate the derivative of the loss function (L) with respect to each trainable parameter (W1, B1, W2, B2). This is achieved using the chain rule of differentiation, working backwards from the loss function through the layers.

For example, to find ∂L/∂W1, you trace the dependencies: L depends on A2, A2 on Z2, Z2 on F, F on P1, P1 on A1, A1 on Z1, and Z1 on W1. So, by the chain rule:

∂L/∂W1 = (∂L/∂A2) * (∂A2/∂Z2) * (∂Z2/∂F) * (∂F/∂P1) * (∂P1/∂A1) * (∂A1/∂Z1) * (∂Z1/∂W1)

Calculating each of these terms and multiplying them together gives the required gradient. Similarly, the gradient for other parameters and intermediate layer outputs can be calculated.

The source breaks down the backpropagation process layer by layer, starting from the end:

Backpropagation on the Fully Connected Layer (W2, B2)

This is analogous to backpropagation in a standard multi-layer perceptron. The gradients for W2 and B2 are needed to update them via gradient descent. The chain of dependencies is: L -> A2 -> Z2 -> W2 (or B2).

The required derivatives are ∂L/∂W2 and ∂L/∂B2. These are calculated using the chain rule:

  • ∂L/∂W2 = (∂L/∂A2) * (∂A2/∂Z2) * (∂Z2/∂W2)
  • ∂L/∂B2 = (∂L/∂A2) * (∂A2/∂Z2) * (∂Z2/∂B2)

The source calculates the product (∂L/∂A2) * (∂A2/∂Z2) first.

  • ∂L/∂A2 is the derivative of the binary cross-entropy loss with respect to A2 (y_hat).
  • ∂A2/∂Z2 is the derivative of the sigmoid function applied to Z2 .
  • The product (∂L/∂A2) * (∂A2/∂Z2) simplifies to A2 - Y (the prediction error).

Now, applying the final terms:

  • ∂Z2/∂W2: From Z2 = F ⋅ W2 + B2, the derivative with respect to W2 is F.
  • ∂Z2/∂B2: From Z2 = F ⋅ W2 + B2, the derivative with respect to B2 is 1.

Combining these, for a single image instance:

  • ∂L/∂W2 = (A2 - Y) * F^T (where F^T is the transpose of the flattened input, used for matrix multiplication to get the correct shape of the gradient matrix matching W2’s shape). The source notes that with batch processing, shape considerations require F^T. The shape of ∂L/∂W2 should match the shape of W2, which is 1x4 in the example, and the calculation (1x1) * (1x4) or (1xM) * (M x 4) for a batch of M images yields a 1x4 result.
  • ∂L/∂B2 = A2 - Y. The shape of ∂L/∂B2 should match B2’s shape (1x1), and this result is consistent.

Thus, the gradients for the fully connected layer parameters are calculated.

Backpropagation on the Flatten Layer

The flatten layer does not have any trainable parameters. Backpropagation through this layer involves reversing the forward operation. During the forward pass, the 2x2 P1 matrix is reshaped into a 4x1 vector F. When backpropagating, the gradient flowing back to the flatten layer (which is ∂L/∂F, part of the chain rule calculation for upstream layers) is reshaped back into the original shape of the max-pooling output P1 (2x2 in this example).

The derivative of Z2 with respect to F (∂Z2/∂F) is simply W2. This term is part of the gradient calculation for layers before the flatten layer, such as ∂L/∂P1.

Backpropagation on the Max-Pooling Layer

Similar to the flatten layer, the max-pooling layer also has no trainable parameters. Backpropagation here involves undoing the forward operation but in a specific way. During the forward pass, only the maximum value from each pooling window is passed through. The source explains that when backpropagating the gradient (∂L/∂P1, which is the 2x2 matrix of gradients flowing back from the flatten layer, conceptually reshaped from ∂L/∂F) to the input of the max-pooling layer (A1, which was 4x4), the gradient is only passed through the locations that were selected as the maximums during the forward pass. All other locations in the input (A1) receive a gradient of zero.

Mathematically, if ∂L/∂P1 represents the gradients flowing back to the output of the max-pooling layer, the gradient passed back to the input A1 (∂L/∂A1) is constructed by taking the elements of ∂L/∂P1 and placing them in the corresponding positions within a zero matrix of A1’s original shape, where the maximums were located in the forward pass. Positions in A1 that were not the maximums within their respective windows receive a gradient of zero.

Backpropagation on the ReLU Layer

The ReLU activation function A1 = ReLU(Z1) also does not have trainable parameters. Backpropagation involves calculating the derivative of the ReLU function.

The derivative of ReLU(Z) with respect to Z is 1 if Z > 0 and 0 if Z <= 0. When backpropagating the gradient ∂L/∂A1 to the input of the ReLU layer (Z1), the gradient ∂L/∂Z1 is calculated by multiplying the incoming gradient ∂L/∂A1 element-wise by the derivative of ReLU(Z1).

∂L/∂Z1 = ∂L/∂A1 * ReLU'(Z1)

Where ReLU'(Z1) is a matrix of the same shape as Z1, containing 1s where the original Z1 was positive and 0s where it was zero or negative. This means only positive activations in the forward pass contribute to the gradient flowing backwards.

Backpropagation on the Convolutional Layer (W1, B1)

This is the most complex part of CNN backpropagation as it involves updating the weights and bias of the filter. The required gradients are ∂L/∂W1 and ∂L/∂B1. The incoming gradient to this layer is ∂L/∂Z1.

Calculating ∂L/∂B1: The bias B1 is added element-wise to the result of the convolution Convolution(X, W1) to get Z1. This means B1 affects every element of Z1. The source shows that the derivative of L with respect to B1 (∂L/∂B1) is simply the sum of all elements in the gradient matrix ∂L/∂Z1.

∂L/∂B1 = Sum(∂L/∂Z1)

Calculating ∂L/∂W1: This is the most intricate calculation. The derivative ∂L/∂W1 is part of the chain rule involving ∂L/∂Z1 and ∂Z1/∂W1 . The source explains this by looking at how each element of the output Z1 is produced by the convolution operation between X and W1 The calculation of ∂L/∂W1 involves a operation that resembles convolution between the input image X and the gradient matrix ∂L/∂Z1.

Specifically, the source demonstrates how the elements of ∂L/∂W1 are calculated by summing products of elements from X and ∂L/∂Z1 in a structured way. This pattern is identified as a convolution operation between the input X and the gradient ∂L/∂Z1. The source concludes that ∂L/∂W1 is obtained by convolving the input image X with the gradient matrix ∂L/∂Z1.

∂L/∂W1 = Convolution(X, ∂L/∂Z1)

The source notes that the exact form of this convolution might require careful consideration of padding, stride, and filter flipping depending on the exact definition of convolution used, but the conceptual idea is clear: the gradient for the filter weights is derived by convolving the input with the error signal coming back to the output of the convolution layer.

In summary, backpropagation in CNNs involves applying the chain rule backwards through the network. While some layers without parameters simply pass or selectively pass gradients (Flatten, Max-Pooling, ReLU), the layers with parameters (Convolutional, Fully Connected) calculate gradients by combining the incoming gradient with derivatives of their own operations. The calculation for convolutional layer weights is a convolution-like operation itself.