Convolutional layer

Convolutional layers are some of the primary building blocks of convolutional neural networks (CNNs), a class of neural network most commonly applied to images, video, audio, and other data that have the property of uniform translational symmetry.

[1] The convolution operation in a convolutional layer involves sliding a small window (called a kernel or filter) across the input data and computing the dot product between the values in the kernel and the input at each position.

[2] Kernels, also known as filters, are small matrices of weights that are learned during the training process.

Commonly used convolutions are 1D (for audio and text), 2D (for images), and 3D (for spatial objects, and videos).

Padding involves adding extra pixels around the edges of the input data.

It serves two main purposes: Common padding strategies include: Common padding algorithms include: The exact numbers used in convolutions is complicated, for which we refer to (Dumoulin and Visin, 2018)[3] for details.

[4] It was first developed by Laurent Sifre during an internship at Google Brain in 2013 as an architectural variation on AlexNet to improve convergence speed and model size.

The concept of convolution in neural networks was inspired by the visual cortex in biological brains.

Early work by Hubel and Wiesel in the 1960s on the cat's visual system laid the groundwork for artificial convolution networks.

[9][10] In 1998, Yann LeCun et al. introduced LeNet-5, an early influential CNN architecture for handwritten digit recognition, trained on the MNIST dataset.

[11] (Olshausen & Field, 1996)[12] discovered that simple cells in the mammalian primary visual cortex implement localized, oriented, bandpass receptive fields, which could be recreated by fitting sparse linear codes for natural scenes.

[13]: Fig 3 The field saw a resurgence in the 2010s with the development of deeper architectures and the availability of large datasets and powerful GPUs.

AlexNet, developed by Alex Krizhevsky et al. in 2012, was a catalytic event in modern deep learning.