Road to ML Engineer #20 - Convolutional Neural Networks

I am sure we have already seen how powerful simple feedforward neural networks are, but we've also seen how limited they can be in terms of handling complexity. When training dense layers in more complex models like VAEs and GANs on small images, computation and training became significantly slower and required more hyperparameter tuning. Hence, we noticed the need for a new, more efficient way of processing larger data with higher dimensions. The more efficient method we will discuss in this article is the Convolutional Neural Network (CNN).

Kernel Convolution

Kernel convolution is when you take a small kernel (or filter), slide it over the image, and take the linear combination (sum of products) between the kernel values and the pixel values of the image, creating a new image. It is easier to understand visually, so I have attached a visualization of convolution below.

The image above shows a kernel convolution with a 2x2 kernel on a 3x3 image. In particular, the above applies a blur kernel or filter, which is the same as taking the sum (weighted average with equal weights of 1) of nearby pixels. You can imagine changing the kernel size and values to extract edges and other features as well. (If you're interested in edge detection, I would recommend watching the video, Finding the Edges (Sobel Operator) - Computerphile. )

Kernels as Neurons

In a feedforward neural network, we set up neurons that perform linear combinations with all the input activations. This means that each neuron will have the same number of weights as the number of input activations. So, if we had an image of size 784 (28x28), each neuron in the first hidden layer would have 784 weights, corresponding to each input activation, in order to extract features.

However, if we apply kernel convolution instead of neurons, we will only have weights corresponding to each grid of the kernel. This allows us to share the same weights across different pixels and drastically reduce the number of weights. Another major advantage is that convolution is more robust to shifts and rotations in images, as the weights are shared across pixels, unlike neurons which have different weights for every pixel.

By stacking kernels and applying non-linear activation functions after the convolution, we can create a convolutional layer. By stacking these convolutional layers as hidden layers to capture different levels of features, we create Convolutional Neural Networks. When defining the convolution layer, you can decide how many pixels you slide the filter across the image (stride), and how many pixels you add around the images to manipulate the output dimensions (padding).

Backpropagation

To train the kernel weights, we need to compute the partial derivative of the loss function with respect to the kernel weights and the input features for further backpropagation. Let's first express the convolution operation mathematically.

O = X * F

Here, $O$ represents the output of the convolution, $X$ is the input features, $*$ is the symbol for the convolution operation, and $F$ is the filter or kernel. When applying discrete convolution like the example above, the following details the computations:

O_{1,1} = X_{1,1} F_{1,1} + X_{1,2} F_{1,2} + X_{2,1} F_{2,1} + X_{2,2} F_{2,2} \\ O_{1,2} = X_{1,2} F_{1,1} + X_{1,3} F_{1,2} + X_{2,2} F_{2,1} + X_{2,3} F_{2,2} \\ O_{2,1} = X_{2,1} F_{1,1} + X_{2,2} F_{1,2} + X_{3,1} F_{2,1} + X_{3,2} F_{2,2} \\ O_{2,2} = X_{2,2} F_{1,1} + X_{2,3} F_{1,2} + X_{3,2} F_{2,1} + X_{3,3} F_{2,2}

First, let's compute the loss gradient with respect to the kernel weights, which can be expressed as:

\frac{\partial L}{\partial F_i} = \sum_{k=1}^{M} \frac{\partial L}{\partial O_k} \frac{\partial O_k}{\partial F_i}

The above can be expanded as follows for $F_{1,1}$ :

\frac{\partial L}{\partial F_{1,1}} = \frac{\partial L}{\partial O_{1,1}} \frac{\partial O_{1,1}}{\partial F_{1,1}} + \frac{\partial L}{\partial O_{1,2}} \frac{\partial O_{1,2}}{\partial F_{1,1}} + \frac{\partial L}{\partial O_{2,1}} \frac{\partial O_{2,1}}{\partial F_{1,1}} + \frac{\partial L}{\partial O_{2,2}} \frac{\partial O_{2,2}}{\partial F_{1,1}}

Because we are simply multiplying $X$ and $F$ , the partial derivative $\frac{\partial O}{\partial F_{1,1}}$ is just the corresponding $X$ . Hence, we can rewrite the above derivative for $F_{1,1}$ as shown below:

\frac{\partial L}{\partial F_{1,1}} = \frac{\partial L}{\partial O_{1,1}} X_{1,1} + \frac{\partial L}{\partial O_{1,2}} X_{1,2} + \frac{\partial L}{\partial O_{2,1}} X_{2,1} + \frac{\partial L}{\partial O_{2,2}} X_{2,2}

This applies to all the filter values $F$ . Do you notice something from the above equation? Yes, the partial derivative of the loss function with respect to the kernel weights is just the convolution of $X$ with the loss gradient with respect to the output.

\frac{\partial L}{\partial F} = X * \frac{\partial L}{\partial O}

Next, let's compute the loss gradient with respect to the input features $X$ , which can be expressed as:

\frac{\partial L}{\partial X_i} = \sum_{k=1}^{M} \frac{\partial L}{\partial O_k} \frac{\partial O_k}{\partial X_i}

We can expand the above for some $X$ values:

\frac{\partial L}{\partial X_{1,1}} = \frac{\partial L}{\partial O_{1,1}} F_{1,1} \\ \frac{\partial L}{\partial X_{1,2}} = \frac{\partial L}{\partial O_{1,1}} F_{1,2} + \frac{\partial L}{\partial O_{1,2}} F_{1,1}

It’s hard to notice from the above, but we can confirm that the above is equivalent to the full convolution of the kernel, roatated 180 degrees, and the loss gradient with respect to the output. (Full convolution is when you apply convolution as long as there is any overlap. It doesn’t need to overlap the entire kernel. You can also interpret it as zero-padded padding applied to the input. )

\frac{\partial L}{\partial X} = \frac{\partial L}{\partial O} *_{full} F_{rotated}

Note: The above applies only to normal convolution with a stride of 1 and no padding. For convolutions with padding, and other types of convolutions such as dilated convolution and pixel shuffle convolutions, the derivatives need to be computed in specific ways. (For more details, I recommend checking CNN Backpropagation by Sadiq, R.(2021) and other resources online.)

Code Implementation

Let's try implementing convolutional neural networks (CNNs). We will use the MNIST dataset to implement a multiclass CNN-based classifier that classifies handwritten digits.

Step 1 & 2. Data Exploration and Preprocessing

As we have already explored the MNIST dataset, we can jump straight to data preprocessing. Unlike feedforward neural networks, we do not need to flatten the images, and instead, we can directly operate on the 2D images. However, we need to introduce an additional dimension for the channels. Below is the code for preprocessing.

## TensorFlow Reshape
X_train = X_train.reshape(X_train.shape[0], X_train.shape[1], X_train.shape[2], 1)
X_test = X_test.reshape(X_test.shape[0], X_test.shape[1], X_test.shape[2], 1)
 
## PyTorch Reshape
X_train = X_train.reshape(X_train.shape[0], 1, X_train.shape[1], X_train.shape[2])
X_test = X_test.reshape(X_test.shape[0], 1, X_test.shape[1], X_test.shape[2])
 
def zscore(X, axis = None):
    X_mean = X.mean(axis=axis, keepdims=True)
    X_std  = np.std(X, axis=axis, keepdims=True)
    zscore = (X-X_mean)/X_std
    return zscore
 
X_train = zscore(X_train)
X_test = zscore(X_test)
 
y_train = keras.utils.to_categorical(y_train)
y_test = keras.utils.to_categorical(y_test)
 
from sklearn.model_selection import train_test_split
 
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=10000, random_state=101)
 
## For PyTorch Only
import torch.nn as nn
 
X_train, X_val, X_test = map(lambda X: torch.tensor(X, dtype=torch.float32), (X_train, X_val, X_test))
y_train, y_val, y_test = map(lambda y: torch.tensor(y, dtype=torch.float32), (y_train, y_val, y_test))
 
train_dataset = torch.utils.data.TensorDataset(X_train, y_train)
val_dataset = torch.utils.data.TensorDataset(X_val, y_val)
test_dataset = torch.utils.data.TensorDataset(X_test, y_test)
 
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=32, shuffle=True)
val_loader = torch.utils.data.DataLoader(dataset=val_dataset, batch_size=32, shuffle=True)
test_loader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=1, shuffle=True)

Step 3. Model

Below is an example implementation of a CNN-based image classifier in TensorFlow and PyTorch.

Here, I will omit the training results and Step 4 (model evaluation) since there isn't much to discuss. I highly recommend you try it yourself as practice. (Spoiler: It learns to classify digits extremely well with fewer parameters than feedforward neural networks.)

Tip for Dimension Calculation

If you are not familiar with convolutional layers, you might be confused about the output dimensions of the layers when certain kernel sizes, strides, and paddings are used. In such cases, you can use the following formula to determine the output dimension:

D_{out} = \frac{D_{in} + 2p - k}{s} + 1

Here, $D_{out}$ is the dimension of the output after convolution, $D_{in}$ is the input dimension before the convolution, $p$ is the padding, $k$ is the kernel size, and $s$ is the stride.

Conclusion

In this article, we covered how kernels can be used as neurons to reduce the number of parameters and form convolutional layers and convolutional neural networks. We also discussed how to compute the gradient of the convolution operation and how to implement convolutional neural networks in TensorFlow and PyTorch.

While convolutional layers are quite useful, there is one problem: we cannot use convolutional layers to expand dimensions like dense layers can. Technically, we could expand dimensions by adding more padding, but it is not ideal to add excessive padding with no useful information. In the next article, we will discuss how the concept of convolution can be used to expand dimensions.