Normalization (machine learning)

Data normalization (or feature scaling) includes methods that rescale input data so that the features have the same range, mean, variance, or other statistical properties.

Activation normalization, on the other hand, is specific to deep learning, and includes methods that rescale the activation of hidden neurons inside neural networks.

Normalization is often used to: Normalization techniques are often theoretically justified as reducing covariance shift, smoothing optimization landscapes, and increasing regularization, though they are mainly justified by empirical success.

[1] Batch normalization (BatchNorm)[2] operates on the activations of a layer for each mini-batch.

where each network module can be a linear transform, a nonlinear activation function, a convolution, etc.

BatchNorm is a module that can be inserted at any point in the feedforward network.

The BatchNorm module computes the coordinate-wise mean and variance of these vectors:

-th coordinate of each vector in the batch, and computing the mean and variance of these numbers.

added to the variance for numerical stability, to avoid division by zero.

[3] BatchNorm can be interpreted as removing the purely linear transformations, so that its layers focus solely on modelling the nonlinear aspects of data, which may be beneficial, as a neural network can always be augmented with a linear transformation layer on top.

[7][8] The original paper[2] recommended to only use BatchNorms after a linear transform, not after a nonlinear activation.

does not matter, since it would be canceled by the subsequent mean subtraction, so it is of the form

[2] For convolutional neural networks (CNNs), BatchNorm must preserve the translation-invariance of these models, meaning that it must treat all outputs of the same kernel as if they are different data points within a batch.

Some examples include:[11] A particular problem with BatchNorm is that during training, the mean and variance are calculated on the fly for each batch (usually as an exponential moving average), but during inference, the mean and variance were frozen from those calculated during training.

The disparity can be decreased by simulating the moving average during inference:[11]: Eq.

[12] Layer normalization (LayerNorm)[13] is a popular alternative to BatchNorm.

Compared to BatchNorm, LayerNorm's performance is not affected by batch size.

For a given data input and layer, LayerNorm computes the mean

In recurrent neural networks[13] and transformers,[14] LayerNorm is applied individually to each timestep.

Root mean square layer normalization (RMSNorm)[15] changes LayerNorm by:

[17] For example, in a DiT, the conditioning information (such as a text encoding vector) is processed by a multilayer perceptron into

[19] The spectral radius can be efficiently computed by the following algorithm: INPUT matrix

[24] Group normalization (GroupNorm)[25] is a technique also solely used for CNNs.

Adaptive instance normalization (AdaIN) is a variant of instance normalization, designed specifically for neural style transfer with CNNs, rather than just CNNs in general.

Then, AdaIn first computes the mean and variance of the activations of the content image

It was difficult to train, and required careful hyperparameter tuning and a "warm-up" in learning rate, where it starts small and gradually increases.

The pre-LN convention, proposed several times in 2018,[28] was found to be easier to train, requiring no warm-up, leading to faster convergence.

[29] FixNorm[30] and ScaleNorm[31] both normalize activation vectors in a transformer.

The FixNorm method divides the output vectors from a transformer by their L2 norms, then multiplies by a learned parameter

The ScaleNorm replaces all LayerNorms inside a transformer by division with L2 norm, then multiplying by a learned parameter