Knowledge distillation

In machine learning, knowledge distillation or model distillation is the process of transferring knowledge from a large model to a smaller one.

It can be just as computationally expensive to evaluate a model even if it utilizes little of its knowledge capacity.

As smaller models are less expensive to evaluate, they can be deployed on less powerful hardware (such as a mobile device).

Knowledge distillation has been successfully used in several applications of machine learning such as object detection,[2] acoustic models,[3] and natural language processing.

[4] Recently, it has also been introduced to graph neural networks applicable to non-grid data.

[5] Knowledge transfer from a large model to a small one somehow needs to teach the latter without loss of validity.

However, some information about a concise knowledge representation is encoded in the pseudolikelihoods assigned to its output: when a model correctly predicts a class, it assigns a large value to the output variable corresponding to such class, and smaller values to the other output variables.

The distribution of values among the outputs for a record provides information on how the large model represents knowledge.

Therefore, the goal of economical deployment of a valid model can be achieved by training only the large model on the data, exploiting its better ability to learn concise knowledge representations, and then distilling such knowledge into the smaller model, by training it to learn the soft output of the large model.

[1] Given a large model as a function of the vector variable

, trained for a specific classification task, typically the final layer of the network is a softmax in the form where

is the temperature, a parameter which is set to 1 for a standard softmax.

The softmax operator converts the logit values

Knowledge distillation consists of training a smaller network, called the distilled model, on a data set called the transfer set (which is different than the data set used to train the large model) using cross-entropy as the loss function between the output of the distilled model

on the same record (or the average of the individual outputs, if the large model is an ensemble), using a high value of softmax temperature

for both models[1] In this context, a high temperature increases the entropy of the output, therefore providing more information to learn for the distilled model compared to hard targets, and at the same time reducing the variance of the gradient between different records, thus allowing a higher learning rate.

[1] If ground truth is available for the transfer set, the process can be strengthened by adding to the loss the cross-entropy between the output of the distilled model (computed with

where the component of the loss with respect to the large model is weighted by a factor of

since, as the temperature increases, the gradient of the loss with respect to the model weights scales by a factor of

[1] Under the assumption that the logits have zero mean, it is possible to show that model compression is a special case of knowledge distillation.

A related methodology was model compression or pruning, where a trained network is reduced in size.

This was first done in 1965 by Alexey Ivakhnenko and Valentin Lapa in USSR (1965).

Superfluous hidden units were pruned using a separate validation set.

[10] Other neural network compression methods include Biased Weight Decay[11] and Optimal Brain Damage.

[6] An early example of neural network distillation was published by Jürgen Schmidhuber in 1991, in the field of recurrent neural networks (RNNs).

Simultaneously, the automatizer predicted the internal states of the chunker.

After the automatizer manages to predict the chunker's internal states well, it would start fixing the errors, and soon the chunker is obsoleted, leaving just one RNN in the end.

[17] Compressing the knowledge of multiple models into a single neural network was called model compression in 2006: compression was achieved by training a smaller model on large amounts of pseudo-data labelled by a higher-performing ensemble, optimizing to match the logit of the compressed model to the logit of the ensemble.

[18] The knowledge distillation preprint of Geoffrey Hinton et al. (2015)[1] formulated the concept and showed some results achieved in the task of image classification.

Knowledge distillation is also related to the concept of behavioral cloning discussed by Faraz Torabi et.