Highway network

[1][2][3] It uses skip connections modulated by learned gating mechanisms to regulate information flow, inspired by long short-term memory (LSTM) recurrent neural networks.

[4][5] The advantage of the Highway Network over other deep learning architectures is its ability to overcome or partially prevent the vanishing gradient problem,[6] thus improving its optimization.

[1][2] Highway Networks have found use in text sequence labeling and speech recognition tasks.

[7][8] In 2014, the state of the art was training deep neural networks with 20 to 30 layers.

[9] Stacking too many layers led to a steep reduction in training accuracy,[10] known as the "degradation" problem.

The latter two gates are non-linear transfer functions (specifically sigmoid by convention).

The structure of a hidden layer in the Highway Network follows the equation:

Sepp Hochreiter analyzed the vanishing gradient problem in 1991 and attributed to it the reason why deep learning did not work well.

[6] To overcome this problem, Long Short-Term Memory (LSTM) recurrent neural networks[4] have residual connections with a weight of 1.0 in every LSTM cell (called the constant error carrousel) to compute

This enables training very deep recurrent neural networks with a very long time span t. A later LSTM version published in 2000[5] modulates the identity LSTM connections by so-called "forget gates" such that their weights are not fixed to 1.0 but can be learned.

In experiments, the forget gates were initialized with positive bias weights,[5] thus being opened, addressing the vanishing gradient problem.

It was reported to be "the first very deep feedforward network with hundreds of layers".

Networks with 50 or 100 layers had lower training error than their plain network counterparts, but no lower training error than their 20 layers counterpart (on the MNIST dataset, Figure 1 in [16]).

No improvement on test accuracy was reported with networks deeper than 19 layers (on the CIFAR-10 dataset; Table 1 in [16]).

The ResNet paper,[17] however, provided strong experimental evidence of the benefits of going deeper than 20 layers.

It argued that the identity mapping without modulation is crucial and mentioned that modulation in the skip connection can still lead to vanishing signals in forward and backward propagation (Section 3 in [17]).

Similarly, a Highway Net whose gates are opened through strongly positive bias weights behaves like a ResNet.

The skip connections used in modern neural networks (e.g., Transformers) are dominantly identity mappings.