Perplexity

In information theory, perplexity is a measure of uncertainty in the value of a sample from a discrete probability distribution.

The larger the perplexity, the less likely it is that an observer can guess the value which will be drawn from the distribution.

Perplexity was originally introduced in 1977 in the context of speech recognition by Frederick Jelinek, Robert Leroy Mercer, Lalit R. Bahl, and James K.

[1] The perplexity PP of a discrete probability distribution p is a concept widely used in information theory, machine learning, and statistical modeling.

It is defined as where H(p) is the entropy (in bits) of the distribution, and x ranges over the events.

It can be thought of as a measure of uncertainty or "surprise" related to the outcomes.

In this context, the perplexity k indicates that there is as much uncertainty as there would be when rolling a fair k-sided die.

It is, however, generally not a straight forward representation of the relevant probability.

For example, if you have two choices, one with probability 0.9, your chances of a correct guess using the optimal strategy are 90 percent.

Entropy measures the expected or "average" number of bits required to encode the outcome of the random variable using an optimal variable-length code.

It can also be regarded as the expected information gain from learning the outcome of the random variable, providing insight into the uncertainty and complexity of the underlying probability distribution.

A model of an unknown probability distribution p, may be proposed based on a training sample that was drawn from p. Given a proposed probability model q, one may evaluate q by asking how well it predicts a separate test sample x1, x2, ..., xN also drawn from p. The perplexity of the model q is defined as where

Better models q of the unknown distribution p will tend to assign higher probabilities q(xi) to the test events.

This is equivalent to saying that better models have higher likelihoods for the test data, which leads to a lower perplexity value.

The exponent above may be regarded as the average number of bits needed to represent a test event xi if one uses an optimal code based on q. Low-perplexity models do a better job of compressing the test sample, requiring few bits per test element on average because q(xi) tends to be high.

denotes the empirical distribution of the test sample (i.e.,

In natural language processing (NLP), a corpus is a structured collection of texts or documents, and a language model is a probability distribution over entire texts or documents.

Consequently, in NLP, the more commonly used measure is perplexity per token (word or, more frequently, sub-word), defined as:

Suppose the average text xi in the corpus has a probability of

In other words, the model is as confused on test data as if it had to choose uniformly and independently among 247 possibilities for each token.

There are two standard evaluation metrics for language models: perplexity or word error rate(WER).

The simpler of these measures, WER, is simply the percentage of erroneously recognized words E (deletions, insertions, substitutions) to total number of words N, in a speech recognition task i.e.

The second metric, perplexity (per token), is an information theoretic measure that evaluates the similarity of proposed model m to the original distribution p. It can be computed as a inverse of (geometric) average probability of test set T

Since 2007, significant advancements in language modeling have emerged, particularly with the advent of deep learning techniques.

This measure was employed to compare different models on the same dataset and guide the optimization of hyperparameters, although it has been found sensitive to factors such as linguistic features and sentence length.

[2] Despite its pivotal role in language model development, perplexity has shown limitations, particularly as an inadequate predictor of speech recognition performance, overfitting and generalization,[3][4] raising questions about the benefits of blindly optimizing perplexity alone.

The lowest perplexity that had been published on the Brown Corpus (1 million words of American English of varying topics and genres) as of 1992 is indeed about 247 per word/token, corresponding to a cross-entropy of log2247 = 7.95 bits per word or 1.75 bits per letter[5] using a trigram model.

While this figure represented the state of the art (SOTA) at the time, advancements in techniques such as deep learning have led to significant improvements in perplexity on other benchmarks, such as the One Billion Word Benchmark.

[6] In the context of the Brown Corpus, simply guessing that the next word is "the" will achieve an accuracy of 7 percent, contrasting with the 1/247 = 0.4 percent that might be expected from a naive use of perplexity.

This difference underscores the importance of the statistical model used and the nuanced nature of perplexity as a measure of predictiveness.