Batch normalization

It was believed that it can mitigate the problem of internal covariate shift, where parameter initialization and changes in the distribution of the inputs of each layer affect the learning rate of the network.

[1] Recently, some scholars have argued that batch normalization does not reduce internal covariate shift, but rather smooths the objective function, which in turn improves the performance.

[3] Others maintain that batch normalization achieves length-direction decoupling, and thereby accelerates neural networks.

Although a clear-cut precise definition seems to be missing, the phenomenon observed in experiments is the change on means and variances of the inputs to internal layers during training.

Therefore, the method of batch normalization is proposed to reduce these unwanted shifts to speed up training and to produce more reliable models.

Besides reducing internal covariate shift, batch normalization is believed to introduce many other benefits.

With this additional operation, the network can use higher learning rate without vanishing or exploding gradients.

Furthermore, batch normalization seems to have a regularizing effect such that the network improves its generalization properties, and it is thus unnecessary to use dropout to mitigate overfitting.

It has also been observed that the network becomes more robust to different initialization schemes and learning rates while using batch normalization.

Ideally, the normalization would be conducted over the entire training set, but to use this step jointly with stochastic optimization methods, it is impractical to use the global information.

The described BN transform is a differentiable operation, and the gradient of the loss l with respect to the different parameters can be computed directly with the chain rule.

Instead, the normalization step in this stage is computed with the population statistics such that the output could depend on the input in a deterministic manner.

Although batch normalization has become popular due to its strong empirical performance, the working mechanism of the method is not yet well-understood.

In the third model, the noise has non-zero mean and non-unit variance, i.e. it explicitly introduces covariate shift.

Secondly, the quadratic form of the loss Hessian with respect to activation in the gradient direction can be bounded as

In addition to the smoother landscape, it is further shown that batch normalization could result in a better initialization with the following inequality:

Some scholars argue that the above analysis cannot fully capture the performance of batch normalization, because the proof only concerns the largest eigenvalue, or equivalently, one direction in the landscape at all points.

It is suggested that the complete eigenspectrum needs to be taken into account to make a conclusive analysis.

If the shift introduced by the changes in previous layers is small, then the correlation between the gradients would be close to 1.

Interestingly, it is shown that the standard VGG and DLN models both have higher correlations of gradients compared with their counterparts, indicating that the additional batch normalization layers are not reducing internal covariate shift.

Even though batchnorm was originally introduced to alleviate gradient vanishing or explosion problems, a deep batchnorm network in fact suffers from gradient explosion at initialization time, no matter what it uses for nonlinearity.

[3] This gradient explosion on the surface contradicts the smoothness property explained in the previous section, but in fact they are consistent.

Another possible reason for the success of batch normalization is that it decouples the length and direction of the weight vectors and thus facilitates better training.

With the reparametrization interpretation, it could then be proved that applying batch normalization to the ordinary least squares problem achieves a linear convergence rate in gradient descent, which is faster than the regular gradient descent with only sub-linear convergence.

It is proven that the gradient descent convergence rate of the generalized Rayleigh quotient is

The problem of learning halfspaces refers to the training of the Perceptron, which is the simplest form of neural network.

With the Gaussian assumption, it can be shown that all critical points lie on the same line, for any choice of loss function

Combining this global property with length-direction decoupling, it could thus be proved that this optimization problem converges linearly.

The GDNP algorithm thus slightly modifies the batch normalization step for the ease of mathematical analysis.

Although the proof stands on the assumption of Gaussian input, it is also shown in experiments that GDNP could accelerate optimization without this constraint.