Residual neural network

[2][3] As a point of terminology, "residual connection" refers to the specific architectural motif of

The residual connection stabilizes the training and convergence of deep neural networks with hundreds of layers, and is a common motif in deep neural networks, such as transformer models (e.g., BERT, and GPT models such as ChatGPT), the AlphaGo Zero system, the AlphaStar system, and the AlphaFold system.

In a multilayer neural network model, consider a subnetwork with a certain number of stacked layers (e.g., 2 or 3).

[1] A deep residual network is constructed by simply stacking these blocks.

An LSTM with a forget gate essentially functions as a highway network.

To stabilize the variance of the layers' inputs, it is recommended to replace the residual connections

The introduction of identity mappings facilitates signal propagation in both forward and backward paths.

-th input is: Applying this formulation recursively, e.g.: yields the general relationship: where

The residual learning formulation provides the added benefit of mitigating the vanishing gradient problem to some extent.

However, it is crucial to acknowledge that the vanishing gradient issue is not the root cause of the degradation problem, which is tackled through the use of normalization.

To observe the effect of residual blocks on backpropagation, consider the partial derivative of a loss function

:[6] This formulation suggests that the gradient computation of a shallower layer,

[1] This block consists of two sequential 3x3 convolutional layers and a residual connection.

A bottleneck block[1] consists of three sequential convolutional layers and a residual connection.

This design reduces the number of non-identity mappings between residual blocks.

[10] The original ResNet paper made no claim on being inspired by biological systems.

[11][12] A study published in Science in 2023[13] disclosed the complete connectome of an insect brain (specifically that of a fruit fly larva).

This study discovered "multilayer shortcuts" that resemble the skip connections in artificial neural networks, including ResNets.

[14]: Fig 3  McCulloch and Pitts (1943) proposed artificial neural networks and considered those with residual connections.

[15]: Fig 1.h In 1961, Frank Rosenblatt described a three-layer multilayer perceptron (MLP) model with skip connections.

They termed it a "short-cut connection".Sepp Hochreiter discovered the vanishing gradient problem in 1991[20] and argued that it explained why the then-prevalent forms of recurrent neural networks did not work for long sequences.

He and Schmidhuber later designed the LSTM architecture to solve this problem,[4][21] which has a "cell state"

[24] However, stacking too many layers led to a steep reduction in training accuracy,[25] known as the "degradation" problem.

[1] In theory, adding additional layers to deepen a network should not result in a higher training loss, but this is what happened with VGGNet.

[1] If the extra layers can be set as identity mappings, however, then the deeper network would represent the same function as its shallower counterpart.

[6] In 2014, the state of the art was training deep neural networks with 20 to 30 layers.

[24] The research team for ResNet attempted to train deeper ones by empirically testing various methods for training deeper networks, until they came upon the ResNet architecture.

Also known as DropPath, this regularizes training for deep models, such as vision transformers.

[31] An SE module is applied after a convolution, and takes a tensor of shape

This is then passed through a multilayer perceptron (with an architecture such as linear-ReLU-linear-sigmoid) before it is multiplied with the original tensor.

A residual block in a deep residual network. Here, the residual connection skips two layers.
Two variants of convolutional Residual Blocks. [ 1 ] Left : a basic block that has two 3x3 convolutional layers. Right : a bottleneck block that has a 1x1 convolutional layer for dimension reduction, a 3x3 convolutional layer, and another 1x1 convolutional layer for dimension restoration.
Block diagram of ResNet (2015). It shows a ResNet block with and without the 1x1 convolution. The 1x1 convolution (with stride) can be used to change the shape of the array, which is necessary for residual connection through an upsampling/downsampling layer.
The original Resnet-18 architecture. Up to 152 layers were trained in the original publication (as "ResNet-152"). [ 8 ]
The Transformer architecture includes residual connections.
The long short-term memory (LSTM) cell can process data sequentially and keep its hidden state through time. The cell state can function as a generalized residual connection.
Standard (left) and unfolded (right) basic recurrent neural network
ResNeXt block diagram