Weight initialization

A neural network contains trainable parameters that are modified during training: weight initialization is the pre-training step of assigning initial values to these parameters.

The choice of weight initialization method affects the speed of convergence, the scale of neural activation within the network, the scale of gradient signals during backpropagation, and the quality of the final model.

Proper initialization is necessary for avoiding issues such as vanishing and exploding gradients and activation function saturation.

Note that even though this article is titled "weight initialization", both weights and biases are used in a neural network as trainable parameters, so this article describes how both of these are initialized.

Similarly, trainable parameters in convolutional neural networks (CNNs) are called kernels and biases, and this article also describes these.

We discuss the main methods of initialization in the context of a multilayer perceptron (MLP).

Specific strategies for initializing other network architectures are discussed in later sections.

For an MLP, there are only two kinds of trainable parameters, called weights and biases.

Recurrent neural networks typically use activation functions with bounded range, such as sigmoid and tanh, since unbounded activation may cause exploding values.

(Le, Jaitly, Hinton, 2015)[1] suggested initializing weights in the recurrent parts of the network to identity and zero bias.

[2] For neurons with ReLU activation, one can initialize the bias to a small positive value like 0.1, so that the gradient is likely nonzero at initialization, avoiding the dying ReLU problem.

(Saxe et al. 2013)[9] proposed orthogonal initialization: initializing weight matrices as uniformly random (according to the Haar measure) semi-orthogonal matrices, multiplied by a factor that depends on the activation function of the layer.

It was designed so that if one initializes a deep linear network this way, then its training time until convergence is independent of depth.

[10] Sampling a uniformly random semi-orthogonal matrix can be done by initializing

by IID sampling its entries from a standard normal distribution, then calculate

It is a data-dependent initialization method, and can be used in convolutional neural networks.

It first initializes weights of each convolution or fully connected layer with orthonormal matrices.

Then, proceeding from the first to the last layer, it runs a forward pass on a random minibatch, and divides the layer's weights by the standard deviation of its output, so that its output has variance approximately 1.

[16][17] In 2015, the introduction of residual connections allowed very deep neural networks to be trained, much deeper than the ~20 layers of the previous state of the art (such as the VGG-19).

Residual connections gave rise to their own weight initialization problems and strategies.

Fixup initialization is designed specifically for networks with residual connections and without batch normalization, as follows:[18] Similarly, T-Fixup initialization is designed for Transformers without layer normalization.

[20] Random walk initialization was designed for MLP so that during backpropagation, the L2 norm of gradient at each layer performs an unbiased random walk as one moves from the last layer to the first.

[23][5] In self-normalizing neural networks, the SELU activation function

[24] Random weight initialization was used since Frank Rosenblatt's perceptrons.

An early work that described weight initialization specifically was (LeCun et al., 1998).

[5] Before the 2010s era of deep learning, it was common to initialize models by "pre-training" using an unsupervised learning algorithm that is not backpropagation, as it was difficult to directly train deep neural networks by backpropagation.

[27] (Martens, 2010)[20] proposed a quasi-Newton method to directly train deep networks.

The work generated considerable excitement that initializing networks without pre-training phase was possible.

[28] However, a 2013 paper demonstrated that with well-chosen hyperparameters, momentum gradient descent with weight initialization was sufficient for training neural networks, a combination that is still in use as of 2024.

[31] There is a tension between using careful weight initialization to decrease the need for normalization, and using normalization to decrease the need for careful weight initialization, with each approach having its tradeoffs.