Multinomial logistic regression

Multinomial logistic regression is known by a variety of other names, including polytomous LR,[2][3] multiclass LR, softmax regression, multinomial logit (mlogit), the maximum entropy (MaxEnt) classifier, and the conditional maximum entropy model.

Multinomial logistic regression is a particular solution to classification problems that use a linear combination of the observed features and some problem-specific parameters to estimate the probability of each particular value of the dependent variable.

The best values of the parameters for a given problem are usually determined from some training data (e.g. some people for whom both the diagnostic test results and blood types are known, or some examples of known words being spoken).

The multinomial logistic model assumes that data are case-specific; that is, each independent variable has a single value for each case.

As with other types of regression, there is no need for the independent variables to be statistically independent from each other (unlike, for example, in a naive Bayes classifier); however, collinearity is assumed to be relatively low, as it becomes difficult to differentiate between the impact of several variables if this is not the case.

[5] If the multinomial logit is used to model choices, it relies on the assumption of independence of irrelevant alternatives (IIA), which is not always desirable.

This assumption states that the odds of preferring one class over another do not depend on the presence or absence of other "irrelevant" alternatives.

For example, the relative probabilities of taking a car or bus to work do not change if a bicycle is added as an additional possibility.

An example of a problem case arises if choices include a car and a blue bus.

If the multinomial logit is used to model choices, it may in some situations impose too much constraint on the relative preferences between the different alternatives.

Other models like the nested logit or the multinomial probit may be used in such cases as they allow for violation of the IIA.

[6] There are multiple equivalent ways to describe the mathematical model underlying multinomial logistic regression.

The idea behind all of them, as in many other statistical classification techniques, is to construct a linear predictor function that constructs a score from a set of weights that are linearly combined with the explanatory variables (features) of a given observation using a dot product: where Xi is the vector of explanatory variables describing observation i, βk is a vector of weights (or regression coefficients) corresponding to outcome k, and score(Xi, k) is the score associated with assigning observation i to category k. In discrete choice theory, where observations represent people and outcomes represent choices, the score is considered the utility associated with person i choosing outcome k. The predicted outcome is the one with the highest score.

with the same basic setup (the perceptron algorithm, support vector machines, linear discriminant analysis, etc.)

This issue is known as error propagation and is a serious problem in real-world predictive models, which are usually composed of numerous parts.

[citation needed] The basic setup is the same as in logistic regression, the only difference being that the dependent variables are categorical rather than binary, i.e. there are K possible outcomes rather than just two.

The following description is somewhat shortened; for more details, consult the logistic regression article.

These possible values represent logically separate categories (e.g. different political parties, blood types, etc.

), and are often described mathematically by arbitrarily assigning each a number from 1 to K. The explanatory variables and outcome represent observed properties of the data points, and are often thought of as originating in the observations of N "experiments" — although an "experiment" may consist of nothing more than gathering data.

In the process, the model attempts to explain the relative effect of differing explanatory variables on the outcome.

is a regression coefficient associated with the mth explanatory variable and the kth outcome.

(a row vector) is the set of explanatory variables associated with observation i, prepended by a 1 in entry 0.

The unknown parameters in each vector βk are typically jointly estimated by maximum a posteriori (MAP) estimation, which is an extension of maximum likelihood using regularization of the weights to prevent pathological solutions (usually a squared regularizing function, which is equivalent to placing a zero-mean Gaussian prior distribution on the weights, but other distributions are also possible).

Exponentiating both sides turns the additive term into a multiplicative factor, so that the probability is just the Gibbs measure: The quantity Z is called the partition function for the distribution.

We can compute the value of the partition function by applying the above constraint that requires all probabilities to sum to 1: Therefore Note that this factor is "constant" in the sense that it is not a function of Yi, which is the variable over which the probability distribution is defined.

This is due to the fact that all probabilities must sum to 1, making one of them completely determined once all the rest are known.

It is also possible to formulate multinomial logistic regression as a latent variable model, following the two-way latent variable model described for binary logistic regression.

That is: Or equivalently: Let's look more closely at the first equation, which we can write as follows: There are a few things to realize here: Actually finding the values of the above probabilities is somewhat difficult, and is a problem of computing a particular order statistic (the first, i.e. maximum) of a set of values.

The negative log-likelihood function is therefore the well-known cross-entropy: In natural language processing, multinomial LR classifiers are commonly used as an alternative to naive Bayes classifiers because they do not assume statistical independence of the random variables (commonly known as features) that serve as predictors.

In particular, learning in a naive Bayes classifier is a simple matter of counting up the number of co-occurrences of features and classes, while in a maximum entropy classifier the weights, which are typically maximized using maximum a posteriori (MAP) estimation, must be learned using an iterative procedure; see #Estimating the coefficients.