Road to ML Engineer #21 - Transposed Convolution

Last Edited: 9/21/2024

The blog post discusses about transposed convolution in deep learning.

ML

In the previous article, we covered CNNs and their issue with expanding dimensions. Here, we will discuss the solution to this problem: Transposed Convolution.

Transposed Convolution

In kernel convolution, we compute the linear combination of pixel values and kernel values. However, this approach does not help expand dimensions unless padding is applied. Instead of taking the linear combination, we can multiply the pixel values and kernel values element-wise and combine the results in the following manner.

TConv

The above example applies transposed convolution on a 2x2 image (left) with a 2x2 kernel (right). We can confirm that transposed convolution can expand the image’s dimensions from 2x2 to 3x3. In the previous article, however, there was one hidden operation that also expanded the dimensions. This occurred during backpropagation for input activations, which involved full convolution with the rotated kernel.

FullConv

The illustration above shows full convolution with the rotated kernel. It turns out that the backward operation in convolution is equivalent* to transposed convolution and can be used in the forward pass for dimensional expansion. (This means that convolution uses transposed convolution for the gradients with respect to the input.)

Backpropagation

To train the kernel weights, we need to compute the partial derivative of the loss function with respect to the kernel weights and the input features for further backpropagation. Let's first express the transposed convolution operation mathematically:

O=XF O = X \circledast F

Here, OO represents the output of the transposed convolution, XX is the input features, \circledast is the symbol for the transposed convolution operation, and FF is the filter or kernel. When applying transposed convolution like the example above, the following details the computations:

O1,1=X1,1F1,1O1,2=X1,2F1,1+X1,1F1,2O1,3=X1,2F1,2O2,1=X1,1F2,1+X2,1F1,1O2,2=X1,1F2,2+X1,2F2,1+X2,1F1,2+X2,2F1,1O2,3=X1,2F2,2+X2,2F1,2O3,1=X2,1F2,1O3,2=X2,1F2,2+X2,2F2,1O3,3=X2,2F2,2 O_{1,1} = X_{1,1} F_{1,1} \\ O_{1,2} = X_{1,2} F_{1,1} + X_{1,1} F_{1,2} \\ O_{1,3} = X_{1,2} F_{1,2} \\ O_{2,1} = X_{1,1} F_{2,1} + X_{2,1} F_{1,1} \\ O_{2,2} = X_{1,1} F_{2,2} + X_{1,2} F_{2,1} + X_{2,1} F_{1,2} + X_{2,2} F_{1,1} \\ O_{2,3} = X_{1,2} F_{2,2} + X_{2,2} F_{1,2} \\ O_{3,1} = X_{2,1} F_{2,1} \\ O_{3,2} = X_{2,1} F_{2,2} + X_{2,2} F_{2,1} \\ O_{3,3} = X_{2,2} F_{2,2}

First, let's compute the loss gradient with respect to the kernel weights, which can be expressed as:

LFi=k=1MLOkOkFi \frac{\partial L}{\partial F_i} = \sum_{k=1}^{M} \frac{\partial L}{\partial O_k} \frac{\partial O_k}{\partial F_i}

The above can be expanded as follows for F1,1F_{1,1}:

LF1,1=LO1,1O1,1F1,1+LO1,2O1,2F1,1+LO2,1O2,1F1,1+LO2,2O2,2F1,1 \frac{\partial L}{\partial F_{1,1}} = \frac{\partial L}{\partial O_{1,1}} \frac{\partial O_{1,1}}{\partial F_{1,1}} + \frac{\partial L}{\partial O_{1,2}} \frac{\partial O_{1,2}}{\partial F_{1,1}} + \frac{\partial L}{\partial O_{2,1}} \frac{\partial O_{2,1}}{\partial F_{1,1}} + \frac{\partial L}{\partial O_{2,2}} \frac{\partial O_{2,2}}{\partial F_{1,1}}

Since we are simply multiplying XX and FF, the partial derivative OF1,1\frac{\partial O}{\partial F_{1,1}} is just the corresponding XX. Hence, we can rewrite the above derivative for F1,1F_{1,1} as shown below:

LF1,1=LO1,1X1,1+LO1,2X1,2+LO2,1X2,1+LO2,2X2,2 \frac{\partial L}{\partial F_{1,1}} = \frac{\partial L}{\partial O_{1,1}} X_{1,1} + \frac{\partial L}{\partial O_{1,2}} X_{1,2} + \frac{\partial L}{\partial O_{2,1}} X_{2,1} + \frac{\partial L}{\partial O_{2,2}} X_{2,2}

This applies to all the filter values FF. Hence, the partial derivative of the loss function with respect to the kernel weights is just the convolution of XX with the loss gradient with respect to the output.

LF=XLO \frac{\partial L}{\partial F} = X * \frac{\partial L}{\partial O}

Next, let's compute the loss gradient with respect to the input features XX, which can be expressed as:

LXi=k=1MLOkOkXi \frac{\partial L}{\partial X_i} = \sum_{k=1}^{M} \frac{\partial L}{\partial O_k} \frac{\partial O_k}{\partial X_i}

We can expand the above for some XX values:

LX1,1=LO1,1F1,1+LO1,2F1,2+LO2,1F2,1+LO2,2F2,2LX1,2=LO1,2F1,1+LO1,3F1,2+LO2,2F2,1+LO2,3F2,2LX2,1=LO2,1F1,1+LO2,2F1,2+LO3,1F2,1+LO3,2F2,2LX2,2=LO2,2F1,1+LO2,3F1,2+LO3,2F2,1+LO3,3F2,2 \frac{\partial L}{\partial X_{1,1}} = \frac{\partial L}{\partial O_{1,1}} F_{1,1} + \frac{\partial L}{\partial O_{1,2}} F_{1,2} + \frac{\partial L}{\partial O_{2,1}} F_{2,1} + \frac{\partial L}{\partial O_{2,2}} F_{2,2} \\ \frac{\partial L}{\partial X_{1,2}} = \frac{\partial L}{\partial O_{1,2}} F_{1,1} + \frac{\partial L}{\partial O_{1,3}} F_{1,2} + \frac{\partial L}{\partial O_{2,2}} F_{2,1} + \frac{\partial L}{\partial O_{2,3}} F_{2,2} \\ \frac{\partial L}{\partial X_{2,1}} = \frac{\partial L}{\partial O_{2,1}} F_{1,1} + \frac{\partial L}{\partial O_{2,2}} F_{1,2} + \frac{\partial L}{\partial O_{3,1}} F_{2,1} + \frac{\partial L}{\partial O_{3,2}} F_{2,2} \\ \frac{\partial L}{\partial X_{2,2}} = \frac{\partial L}{\partial O_{2,2}} F_{1,1} + \frac{\partial L}{\partial O_{2,3}} F_{1,2} + \frac{\partial L}{\partial O_{3,2}} F_{2,1} + \frac{\partial L}{\partial O_{3,3}} F_{2,2}

The above shows that the partial derivative of the loss function with respect to the input values is also expressed as a convolution of LO\frac{\partial L}{\partial O} with the filter FF.

LX=LOF \frac{\partial L}{\partial X} = \frac{\partial L}{\partial O} * F

We can observe that the backward pass (with respect to the input) and forward pass of convolution and transposed convolution are flipped. (Convolution uses transposed convolution for the backward pass, while transposed convolution uses convolution.) This is why transposed convolution is often referred to as the opposite of convolution or deconvolution.

Code Implementation

Now that we've found a way to expand dimensions with convolution, we can build architectures like autoencoders and GANs using convolution. In this article, we will build a DCGAN (Deep Convolutional Generative Adversarial Network) on MNIST. Since we have already covered steps 1 and 2 (data exploration and preprocessing) in the previous article, we will dive straight into model construction.

Note: DCGAN uses the tanh activation function, whose range is -1 to 1. Therefore, we will normalize the data using the following function instead of zscore.

def min_max(x, axis=None):
    min = x.min(axis=axis, keepdims=True)
    max = x.max(axis=axis, keepdims=True)
    result = 2 * (x - min) / (max - min) - 1
    return result

Step 3. Model

Below is an example implementation of DCGAN in TensorFlow and PyTorch.

You might notice that the training takes significantly more time despite using convolutions. This is likely due to using a larger latent dimension that is projected to a much larger dimension. (We also introduced batch normalization.) Therefore, it is not fair to compare the results directly with the GAN we built last time. To speed up the training, I recommend using a GPU.

Step 4. Model Evaluation

After training the model, we can take the generator from the DCGAN and pass in noise of the appropriate size to generate new images. Let’s see how the images look after training a PyTorch implementation of DCGAN for 30 epochs.

DCGAN

Even after training for only 30 epochs, we can already see much clearer images, resembling handwritten digits. As a challenge, you could build a GAN with a different architecture.

Tip for Dimension Calculation

If you're not familiar with transposed convolutional layers, you might be confused about the output dimensions of the layers when certain kernel sizes, strides, and padding are used. In such cases, you can use the following formula to determine the output dimension:

Dout=(Din1)s2p+k D_{out} = (D_{in}-1)s - 2p + k

Here, DoutD_{out} is the dimension of the output after transposed convolution, DinD_{in} is the input dimension before the transposed convolution, pp is the padding, kk is the kernel size, and ss is the stride.

Conclusion

In the past two articles, we discussed new layers: the convolutional layer and the transposed convolutional layer, which can be used to create predictors, classifiers, feature extractors, and generative models. While the model has improved significantly in terms of efficiency and quality, it still requires a GPU to train the model at a reasonable speed and quality on small images. When we want to train on large images, we still need to use various techniques and an unimaginable level of hardware resources. Therefore, in the next article, I will discuss how to take advantage of what others have done to train large models.

Resources