Backpropagation In Convolutional Neural Networks
Jefkine, 5 September 2016Introduction
Convolutional neural networks (CNNs) are a biologically-inspired variation of the multilayer perceptrons (MLPs). Neurons in CNNs share weights unlike in MLPs where each neuron has a separate weight vector. This sharing of weights ends up reducing the overall number of trainable weights hence introducing sparsity.
Utilizing the weights sharing strategy, neurons are able to perform convolutions on the data with the convolution filter being formed by the weights. This is then followed by a pooling operation which as a form of non-linear down-sampling, progressively reduces the spatial size of the representation thus reducing the amount of computation and parameters in the network.
Existing between the convolution and the pooling layer is an activation function such as the ReLu layer; a non-saturating activation is applied element-wise, i.e. f(x)=max(0,x) thresholding at zero. After several convolutional and pooling layers, the image size (feature map size) is reduced and more complex features are extracted.
Eventually with a small enough feature map, the contents are squashed into a one dimension vector and fed into a fully-connected MLP for processing. The last layer of this fully-connected MLP seen as the output, is a loss layer which is used to specify how the network training penalizes the deviation between the predicted and true labels.
Before we begin lets take look at the mathematical definitions of convolution and cross-correlation:
Cross-correlation
Given an input image I and a filter (kernel) K of dimensions k1×k2, the cross-correlation operation is given by:
(I⊗K)ij=k1−1∑m=0k2−1∑n=0I(i+m,j+n)K(m,n)Convolution
Given an input image I and a filter (kernel) K of dimensions k1×k2, the convolution operation is given by:
(I∗K)ij=k1−1∑m=0k2−1∑n=0I(i−m,j−n)K(m,n)=k1−1∑m=0k2−1∑n=0I(i+m,j+n)K(−m,−n)From Eq. 3 it is easy to see that convolution is the same as cross-correlation with a flipped kernel i.e: for a kernel K where K(−m,−n)==K(m,n).
Convolution Neural Networks - CNNs
CNNs consists of convolutional layers which are characterized by an input map I, a bank of filters K and biases b.
In the case of images, we could have as input an image with height H, width W and C=3 channels (red, blue and green) such that I∈RH×W×C. Subsequently for a bank of D filters we have K∈Rk1×k2×C×D and biases b∈RD, one for each filter.
The output from this convolution procedure is as follows:
(I∗K)ij=k1−1∑m=0k2−1∑n=0C∑c=1Km,n,c⋅Ii+m,j+n,c+bThe convolution operation carried out here is the same as cross-correlation, except that the kernel is “flipped” (horizontally and vertically).
For the purposes of simplicity we shall use the case where the input image is grayscale i.e single channel C=1. The Eq. 4 will be transformed to:
(I∗K)ij=k1−1∑m=0k2−1∑n=0Km,n⋅Ii+m,j+n+bNotation
To help us explore the forward and backpropagation, we shall make use of the following notation:
- l is the lth layer where l=1 is the first layer and l=L is the last layer.
- Input x is of dimension H×W and has i by j as the iterators
- Filter or kernel w is of dimension k1×k2 has m by n as the iterators
- wlm,n is the weight matrix connecting neurons of layer l with neurons of layer l−1.
- bl is the bias unit at layer l.
- xli,j is the convolved input vector at layer l plus the bias represented as
xli,j=∑m∑nwlm,nol−1i+m,j+n+bl
- oli,j is the output vector at layer l given by
oli,j=f(xli,j)
- f(⋅) is the activation function. Application of the activation layer to the convolved input vector at layer l is given by f(xli,j)
Foward Propagation
To perform a convolution operation, the kernel is flipped 180∘ and slid across the input feature map in equal and finite strides. At each location, the product between each element of the kernel and the input input feature map element it overlaps is computed and the results summed up to obtain the output at that current location.
This procedure is repeated using different kernels to form as many output feature maps as desired. The concept of weight sharing is used as demonstrated in the diagram below:
Units in convolutional layer illustrated above have receptive fields of size 4 in the input feature map and are thus only connected to 4 adjacent neurons in the input layer. This is the idea of sparse connectivity in CNNs where there exists local connectivity pattern between neurons in adjacent layers.
The color codes of the weights joining the input layer to the convolutional layer show how the kernel weights are distributed (shared) amongst neurons in the adjacent layers. Weights of the same color are constrained to be identical.
The convolution process here is usually expressed as a cross-correlation but with a flipped kernel. In the diagram below we illustrate a kernel that has been flipped both horizontally and vertically:
The convolution equation of the input at layer l is given by:
xli,j=rot180∘{wlm,n}∗ol−1i,j+bli,jxli,j=∑m∑nwlm,nol−1i+m,j+n+bli,joli,j=f(xli,j)This is illustrated below:
Error
For a total of P predictions, the predicted network outputs yp and their corresponding targeted values tp the the mean squared error is given by: E=12∑p(tp−yp)2
Learning will be achieved by adjusting the weights such that yp is as close as possible or equals to corresponding tp. In the classical backpropagation algorithm, the weights are changed according to the gradient descent direction of an error surface E.
Backpropagation
For backpropagation there are two updates performed, for the weights and the deltas. Lets begin with the weight update.
We are looking to compute ∂E∂wlm′,n′ which can be interpreted as the measurement of how the change in a single pixel wm′,n′ in the weight kernel affects the loss function E.
During forward propagation, the convolution operation ensures that the yellow pixel wm′,n′ in the weight kernel makes a contribution in all the products (between each element of the weight kernel and the input feature map element it overlaps). This means that pixel wm′,n′ will eventually affect all the elements in the output feature map.
Convolution between the input feature map of dimension H×W and the weight kernel of dimension k1×k2 produces an output feature map of size (H−k1+1) by (W−k2+1). The gradient component for the individual weights can be obtained by applying the chain rule in the following way:
∂E∂wlm′,n′=H−k1∑i=0W−k2∑j=0∂E∂xli,j∂xli,j∂wlm′,n′=H−k1∑i=0W−k2∑j=0δli,j∂xli,j∂wlm′,n′In Eq. 10,xli,j is equivalent to ∑m∑nwlm,nol−1i+m,j+n+bl and expanding this part of the equation gives us:
∂xli,j∂wlm′,n′=∂∂wlm′,n′(∑m∑nwlm,nol−1i+m,j+n+bl)Further expanding the summations in Eq. 11 and taking the partial derivatives for all the components results in zero values for all except the components where m=m′ and n=n′ in wlm,nol−1i+m,j+n as follows:
∂xli,j∂wlm′,n′=∂∂wlm′,n′(wl0,0ol−1i+0,j+0+⋯+wlm′,n′ol−1i+m′,j+n′+⋯+bl)=∂∂wlm′,n′(wlm′,n′ol−1i+m′,j+n′)=ol−1i+m′,j+n′Substituting Eq. 12 in Eq. 10 gives us the following results:
∂E∂wlm′,n′=H−k1∑i=0W−k2∑j=0δli,jol−1i+m′,j+n′=rot180∘{δli,j}∗ol−1m′,n′The dual summation in Eq. 13 is as a result of weight sharing in the network (same weight kernel is slid over all of the input feature map during convolution). The summations represents a collection of all the gradients δli,j coming from all the outputs in layer l.
Obtaining gradients w.r.t to the filter maps, we have a cross-correlation which is transformed to a convolution by “flipping” the delta matrix δli,j (horizontally and vertically) the same way we flipped the filters during the forward propagation.
An illustration of the flipped delta matrix is shown below:
The diagram below shows gradients (δ11,δ12,δ21,δ22) generated during backpropagation:
The convolution operation used to obtain the new set of weights as is shown below:
During the reconstruction process, the deltas (δ11,δ12,δ21,δ22) are used. These deltas are provided by an equation of the form:
δli,j=∂E∂xli,j
At this point we are looking to compute ∂E∂xli′,j′ which can be interpreted as the measurement of how the change in a single pixel xi′,j′ in the input feature map affects the loss function E.
From the diagram above, we can see that region in the output affected by pixel xi′,j′ from the input is the region in the output bounded by the dashed lines where the top left corner pixel is given by (i′−k1+1,j′−k2+1) and the bottom right corner pixel is given by (i′,j′).
Using chain rule and introducing sums give us the following equation:
∂E∂xli′,j′=∑i,j∈Q∂E∂xl+1Q∂xl+1Q∂xli′,j′=∑i,j∈Qδl+1Q∂xl+1Q∂xli′,j′Q in the summation above represents the output region bounded by dashed lines and is composed of pixels in the output that are affected by the single pixel xi′,j′ in the input feature map. A more formal way of representing Eq. 16 is:
∂E∂xli′,j′=k1−1∑m=0k2−1∑n=0∂E∂xl+1i′−m,j′−n∂xl+1i′−m,j′−n∂xli′,j′=k1−1∑m=0k2−1∑n=0δl+1i′−m,j′−n∂xl+1i′−m,j′−n∂xli′,j′In the region Q, the height ranges from i′−0 to i′−(k1−1) and width j′−0 to j′−(k2−1). These two can simply be represented by i′−m and j′−n in the summation since the iterators m and n exists in the following similar ranges from 0≤m≤k1−1 and 0≤n≤k2−1.
In Eq. 17,xl+1i′−m,j′−n is equivalent to ∑m′∑n′wl+1m′,n′oli′−m+m′,j′−n+n′+bl+1 and expanding this part of the equation gives us:
∂xl+1i′−m,j′−n∂xli′,j′=∂∂xli′,j′(∑m′∑n′wl+1m′,n′oli′−m+m′,j′−n+n′+bl+1)=∂∂xli′,j′(∑m′∑n′wl+1m′,n′f(xli′−m+m′,j′−n+n′)+bl+1)Further expanding the summation in Eq. 17 and taking the partial derivatives for all the components results in zero values for all except the components where m′=m and n′=n so that f(xli′−m+m′,j′−n+n′) becomes f(xli′,j′) and wl+1m′,n′ becomes wl+1m,n in the relevant part of the expanded summation as follows:
∂xl+1i′−m,j′−n∂xli′,j′=∂∂xli′,j′(wl+1m′,n′f(xl0−m+m′,0−n+n′)+⋯+wl+1m,nf(xli′,j′)+⋯+bl+1)=∂∂xli′,j′(wl+1m,nf(xli′,j′))=wl+1m,n∂∂xli′,j′(f(xli′,j′))=wl+1m,nf′(xli′,j′)Substituting Eq. 19 in Eq. 17 gives us the following results:
∂E∂xli′,j′=k1−1∑m=0k2−1∑n=0δl+1i′−m,j′−nwl+1m,nf′(xli′,j′)For backpropagation, we make use of the flipped kernel and as a result we will now have a convolution that is expressed as a cross-correlation with a flipped kernel:
∂E∂xli′,j′=k1−1∑m=0k2−1∑n=0δl+1i′−m,j′−nwl+1m,nf′(xli′,j′)=rot180∘{k1−1∑m=0k2−1∑n=0δl+1i′+m,j′+nwl+1m,n}f′(xli′,j′)=δl+1i′,j′∗rot180∘{wl+1m,n}f′(xli′,j′)Pooling Layer
The function of the pooling layer is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting. No learning takes place on the pooling layers [2].
Pooling units are obtained using functions like max-pooling, average pooling and even L2-norm pooling. At the pooling layer, forward propagation results in an N×N pooling block being reduced to a single value - value of the “winning unit”. Backpropagation of the pooling layer then computes the error which is acquired by this single value “winning unit”.
To keep track of the “winning unit” its index noted during the forward pass and used for gradient routing during backpropagation. Gradient routing is done in the following ways:
- Max-pooling - the error is just assigned to where it comes from - the “winning unit” because other units in the previous layer’s pooling blocks did not contribute to it hence all the other assigned values of zero
- Average pooling - the error is multiplied by 1N×N and assigned to the whole pooling block (all units get this same value).
Conclusion
Convolutional neural networks employ a weight sharing strategy that leads to a significant reduction in the number of parameters that have to be learned. The presence of larger receptive field sizes of neurons in successive convolutional layers coupled with the presence of pooling layers also lead to translation invariance. As we have observed the derivations of forward and backward propagations will differ depending on what layer we are propagating through.
References
- Dumoulin, Vincent, and Francesco Visin. “A guide to convolution arithmetic for deep learning.” stat 1050 (2016): 23. [PDF]
- LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W.,Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural computation 1(4), 541–551 (1989)
- Wikipedia page on Convolutional neural network
- Convolutional Neural Networks (LeNet) deeplearning.net
- Convolutional Neural Networks UFLDL Tutorial