Softmax function

The term "softmax" derives from the amplifying effects of the exponential on any maxima in the input vector.

, which amounts to assigning almost all of the total unit weight in the result to the position of the vector's maximal element (of 8).

A higher temperature results in a more uniform output distribution (i.e. with higher entropy; it is "more random"), while a lower temperature results in a sharper output distribution, with one value dominating.

The term "softmax" is also used for the closely related LogSumExp function, which is a smooth maximum.

However, if the difference is small relative to the temperature, the value is not close to the arg max.

, so eventually all differences become large (relative to a shrinking temperature), which gives another interpretation for the limit behavior.

are the energies of that state; the denominator is known as the partition function, often denoted by Z; and the factor β is called the coldness (or thermodynamic beta, or inverse temperature).

The softmax function is used in various multiclass classification methods, such as multinomial logistic regression (also known as softmax regression),[2]: 206–209 [6] multiclass linear discriminant analysis, naive Bayes classifiers, and artificial neural networks.

[7] Specifically, in multinomial logistic regression and linear discriminant analysis, the input to the function is the result of K distinct linear functions, and the predicted probability for the jth class given a sample vector x and a weighting vector w is:

, thus transforming the original, probably highly-dimensional, input to vectors in a K-dimensional space

The standard softmax function is often used in the final layer of a neural network-based classifier.

Such networks are commonly trained under a log loss (or cross-entropy) regime, giving a non-linear variant of multinomial logistic regression.

To ensure stable numerical computations subtracting the maximum value from the input vector is common.

This approach, while not altering the output or the derivative theoretically, enhances stability by directly controlling the maximum exponent value computed.

In the field of reinforcement learning, a softmax function can be used to convert values into action probabilities.

The computational effort for the softmax became a major limiting factor in the development of larger neural language models, motivating various remedies to reduce training times.

[9] The hierarchical softmax (introduced by Morin and Bengio in 2005) uses a binary tree structure where the outcomes (vocabulary words) are the leaves and the intermediate nodes are suitably selected "classes" of outcomes, forming latent variables.

[11] In practice, results depend on choosing a good strategy for clustering the outcomes into classes.

[9] A second kind of remedies is based on approximating the softmax (during training) with modified loss functions that avoid the calculation of the full normalization factor.

The standard softmax method involves several loops over the inputs, which would be bottlenecked by memory bandwidth.

The FlashAttention method is a communication-avoiding algorithm that fuses these operations into a single loop, increasing the arithmetic intensity.

are cached, and during the backward pass, attention matrices are rematerialized from these, making it a form of gradient checkpointing.

-dimensional space), due to the linear constraint that all output sum to 1 meaning it lies on a hyperplane.

The standard logistic function is the special case for a 1-dimensional axis in 2-dimensional space, say the x-axis in the (x, y) plane.

[15] The use of the softmax in decision theory is credited to R. Duncan Luce,[16]: 1 who used the axiom of independence of irrelevant alternatives in rational choice theory to deduce the softmax in Luce's choice axiom for relative preferences.

We wish to treat the outputs of the network as probabilities of alternatives (e.g. pattern classes), conditioned on the inputs.

We explain two modifications: probability scoring, which is an alternative to squared error minimisation, and a normalised exponential (softmax) multi-input generalisation of the logistic non-linearity.

This transformation can be considered a multi-input generalisation of the logistic, operating on the whole output layer.

Computation of this example using Python code: The softmax function generates probability predictions densely distributed over its support.

[19] Also the Gumbel-softmax reparametrization trick can be used when sampling from a discrete-discrete distribution needs to be mimicked in a differentiable manner.