Quantities of information

The choice of logarithmic base in the following formulae determines the unit of information entropy that is used.

The most common unit of information is the bit, or more correctly the shannon,[2] based on the binary logarithm.

Although bit is more frequently used in place of shannon, its name is not distinguished from the bit as used in data processing to refer to a binary value or stream regardless of its entropy (information content).

[3] Shannon derived a measure of information content called the self-information or "surprisal" of a message

is chosen from all possible choices in the message space

The base of the logarithm only affects a scaling factor and, consequently, the units in which the measured information content is expressed.

If the logarithm is base 2, the measure of information is expressed in units of shannons or more often simply "bits" (a bit in other contexts is rather defined as a "binary digit", whose average information content is at most 1 shannon).

Messages that convey information over a certain (P=1) event (or one which is known with certainty, for instance, through a back-channel) provide no information, as the above equation indicates.

That can be derived using this definition by considering a compound message

providing information regarding the values of two random variables M and N using a message which is the concatenation of the elementary messages m and n, each of whose information content are given by

If the messages m and n each depend only on M and N, and the processes M and N are independent, then since

Continued darkness until widely scattered light in the morning."

However, a forecast of a snowstorm would certainly contain information since such does not happen every evening.

There would be an even greater amount of information in an accurate forecast of snow for a warm location, such as Miami.

The amount of information in a forecast of snow for a location where it never snows (impossible event) is the highest (infinity).

is a measure of the amount of uncertainty one has about which message will be chosen.

It is defined as the average self-information of a message

is expressed in terms of the probabilities of the distribution: An important special case of this is the binary entropy function: The joint entropy of two discrete random variables

A basic property of the conditional entropy is that: The Kullback–Leibler divergence (or information divergence, information gain, or relative entropy) is a way of comparing two distributions, a "true" probability distribution

If we compress data in a manner that assumes

is the correct distribution, Kullback–Leibler divergence is the number of average additional bits per datum necessary for compression, or, mathematically, It is in some sense the "distance" from

, although it is not a true metric due to its not being symmetric.

This is a measure of how much information can be obtained about one random variable by observing another.

(which represents conceptually the average amount of information about

) is given by: A basic property of the mutual information is that: That is, knowing

Mutual information can be expressed as the average Kullback–Leibler divergence (information gain) of the posterior probability distribution of

: In other words, this is a measure of how much, on the average, the probability distribution on

This is often recalculated as the divergence from the product of the marginal distributions to the actual joint distribution: Mutual information is closely related to the log-likelihood ratio test in the context of contingency tables and the multinomial distribution and to Pearson's χ2 test: mutual information can be considered a statistic for assessing independence between a pair of variables, and has a well-specified asymptotic distribution.

The basic measures of discrete entropy have been extended by analogy to continuous spaces by replacing sums with integrals and probability mass functions with probability density functions.

Although, in both cases, mutual information expresses the number of bits of information common to the two sources in question, the analogy does not imply identical properties; for example, differential entropy may be negative.