Information content

It can be thought of as an alternative way of expressing probability, much like odds or log-odds, but which has particular mathematical advantages in the setting of information theory.

The Shannon information can be interpreted as quantifying the level of "surprise" of a particular outcome.

As it is such a basic quantity, it also appears in several other settings, such as the length of a message needed to transmit the event given an optimal source coding of the random variable.

The Shannon information is closely related to entropy, which is the expected value of the self-information of a random variable, quantifying how surprising the random variable is "on average".

This is the average amount of self-information an observer would expect to gain about a random variable when measuring it.

The term 'perplexity' has been used in language modelling to quantify the uncertainty inherent in a set of prospective events.

Claude Shannon's definition of self-information was chosen to meet several axioms: The detailed derivation is below, but it can be shown that there is a unique function of probability that meets these three axioms, up to a multiplicative scaling factor.

is also often used for the related quantity of mutual information, many authors use a lowercase

For a given probability space, the measurement of rarer events are intuitively more "surprising", and yield more information content, than more common values.

While standard probabilities are represented by real numbers in the interval

, self-informations are represented by extended real numbers in the interval

In particular, we have the following, for any choice of logarithmic base: From this, we can get a few general properties: The Shannon information is closely related to the log-odds.

In other words, the log-odds can be interpreted as the level of surprise when the event doesn't happen, minus the level of surprise when the event does happen.

by definition equal to the expected information content of measurement of

[5] For continuous random variables the corresponding concept is differential entropy.

This term (as a log-probability measure) was coined by Myron Tribus in his 1961 book Thermostatics and Thermodynamics.

so the information gain of a fair coin landing as heads is 1 shannon.

The value of a dice roll is a discrete uniform random variable

Suppose we have two independent, identically distributed random variables

In the case of independent fair 6-sided dice rolls, the random variable

degenerates to a constant random variable with probability distribution deterministically given by

[2] Generalizing all of the above cases, consider a categorical discrete random variable with support

Without loss of generality, we can assume the categorical distribution is supported on the set

From these examples, it is possible to calculate the information of any set of independent DRVs with known distributions by additivity.

For example, quoting a character (the Hippy Dippy Weatherman) of comedian George Carlin:Weather forecast for tonight: dark.

Continued dark overnight, with widely scattered light by morning.

Accordingly, the amount of self-information contained in a message conveying content informing an occurrence of event,

would be expected to equal the sum of the amounts of information of the individual component messages

, the larger the quantity of self-information associated with the message that the event indeed occurred.

As a quick illustration, the information content associated with an outcome of 4 heads (or any specific outcome) in 4 consecutive tosses of a coin would be 4 shannons (probability 1/16), and the information content associated with getting a result other than the one specified would be ~0.09 shannons (probability 15/16).