In machine learning, the vanishing gradient problem is the problem of greatly diverging gradient magnitudes between earlier and later layers encountered when training neural networks with backpropagation.
In such methods, neural network weights are updated proportional to their partial derivative of the loss function.
[1] As the number of forward propagation steps in a network increases, for instance due to greater network depth, the gradients of earlier weights are calculated with increasingly many multiplications.
This difference in gradient magnitude might introduce instability in the training process, slow it, or halt it entirely.
Backpropagation allowed researchers to train supervised deep artificial neural networks from scratch, initially with little success.
[5][6] The latter are trained by unfolding them into very deep feedforward networks, where a new layer is created for each time-step of an input sequence processed by the network (the combination of unfolding and backpropagation is termed backpropagation through time).
This section is based on the paper On the difficulty of training Recurrent Neural Networks by Pascanu, Mikolov, and Bengio.
Training the network requires us to define a loss function to be minimized.
is the sigmoid activation function[note 2], applied to each vector coordinate separately, and
[note 3] For the prototypical exploding gradient problem, the next model is clearer.
puts the system very close to an unstable point, then a tiny variation in
Indeed, it's the only well-behaved gradient, which explains why early researches focused on learning or designing recurrent networks systems that could perform long-ranged computations (such as outputting the first input it sees at the very end of an episode) by shaping its stable attractors.
Batch normalization is a standard method for solving both the exploding and the vanishing gradient problems.
[10][11] In multi-level hierarchy of networks (Schmidhuber, 1992), pre-trained one level at a time through unsupervised learning, fine-tuned through backpropagation.
Then the network is trained further by supervised backpropagation to classify labeled data.
The deep belief network model by Hinton et al. (2006) involves learning the distribution of a high-level representation using successive layers of binary or real-valued latent variables.
It uses a restricted Boltzmann machine to model each new layer of higher level features.
Each new layer guarantees an increase on the lower-bound of the log likelihood of the data, thus improving the model, if trained properly.
[13] Hinton reports that his models are effective feature extractors over high-dimensional, structured data.
[14] Hardware advances have meant that from 1991 to 2015, computer power (especially as delivered by GPUs) has increased around a million-fold, making standard backpropagation feasible for networks several layers deeper than when the vanishing gradient problem was recognized.
Schmidhuber notes that this "is basically what is winning many of the image recognition competitions now", but that it "does not really overcome the problem in a fundamental way"[15] since the original models tackling the vanishing gradient problem by Hinton and others were trained in a Xeon processor, not GPUs.
, then with residual connections, the gradient of output with respect to the activations at layer
Feedforward networks with residual connections can be regarded as an ensemble of relatively shallow nets.
[17] Rectifiers such as ReLU suffer less from the vanishing gradient problem, because they only saturate in one direction.
[18] Weight initialization is another approach that has been proposed to reduce the vanishing gradient problem in deep networks.
Kumar suggested that the distribution of initial weights should vary according to activation function used and proposed to initialize the weights in networks with the logistic activation function using a Gaussian distribution with a zero mean and a standard deviation of 3.6/sqrt(N), where N is the number of neurons in a layer.
[19] Recently, Yilmaz and Poli[20] performed a theoretical analysis on how gradients are affected by the mean of the initial weights in deep neural networks using the logistic activation function and found that gradients do not vanish if the mean of the initial weights is set according to the formula: max(−1,-8/N).
This simple strategy allows networks with 10 or 15 hidden layers to be trained very efficiently and effectively using the standard backpropagation.
Behnke relied only on the sign of the gradient (Rprop) when training his Neural Abstraction Pyramid[21] to solve problems like image reconstruction and face localization.
[citation needed] Neural networks can also be optimized by using a universal search algorithm on the space of neural network's weights, e.g., random guess or more systematically genetic algorithm.