CNNs

Outline: Using Pytorch for CNNs and Kernel Math

Convolution Layers in Pytorch

Unlike linear layers, Convolution layers have more parameters required to initialized. The first two arguments are the `in_channels’ & ‘out_channels’. From a data perspective, a Channel is some stream of data, so if you are looking at one time series of sound data, you have 1 channel. Conversely, a color image is a 3 channel input for a 2D convolution.

This is analogous to the in_neurons and out_neurons of a linear layer, and they must be maintained throughout your network, e.g., a Convolution Layer with an ‘out_channels’ of 3 must be fed into a Convolution Layer with an ‘in_channels’ of 3.

The next input is the kernel size. This is how much that layer “sees” at any given time. For higher dimensional kernels, you can have different sizes across your dimensions, but normally they are all the same.

Next is stride, which is how “fast” a kernel moves across the dataset, a longer stride is analogous to a passing glance, whereas a shorter one is like an in depth search.

The final main parameter is padding, which is how much non-impactful data (usually zeros) are added to the edge of your input. This is used for 2 reasons, the first is that you want to emphasize the edges of your data more. Secondly, it allows you to adjust the number of features leaving a convolution layer.

A nice animation that shows these is found here.

import torch
import torch.nn as nn

batches:int = 2
channels:int = 3
length_1d:int = 50
height_2d:int = 40
width_2d:int = 40

input_1d = torch.randn(batches,channels,length_1d)
input_2d = torch.randn(batches,channels,height_2d,width_2d)

channels_out:int = 5
kernel_size:int = 3
stride:int = 5
padding:int = 3
conv1d = nn.Conv1d(channels,channels_out,kernel_size,stride,padding)
conv2d = nn.Conv2d(channels,channels_out,kernel_size,stride,padding)

print('Size of 1D Output:')
print(conv1d(input_1d).size())

print('Size of 2D Output:')
print(conv2d(input_2d).size())

Size of 1D Output:
torch.Size([2, 5, 11])
Size of 2D Output:
torch.Size([2, 5, 9, 9])

As you can see, the channels are hard coded into the linear layer, but the size of the data dimensions have changed. This is where convolution arithmetic comes into play.

\(L~out~ = \lfloor\frac{L~in~ + 2 \times padding - size}{stride}\rfloor + 1\)

While this changing size doesn’t matter from a network perspective (as long as you are only using convolution layers), it has 2 main implications. The first is that you don’t want to completely lose your data down to nothing. The second is that you usually want to convert to a linear layer, and you will need to know how many features are being fed into it for proper construction. While one way to handle this is to calculate the size for every layer, the other is to use the equation to force a known behavior for example, a kernel size of 3, a pad of 1, and a stride of 2, will always half the length of your input, making the math easy.