Information bottleneck method

[1] It is designed for finding the best tradeoff between accuracy and complexity (compression) when summarizing (e.g. clustering) a random variable X, given a joint probability distribution p(X,Y) between X and an observed relevant variable Y - and self-described as providing "a surprisingly rich framework for discussing a variety of problems in signal processing and learning".

[1] Applications include distributional clustering and dimension reduction, and more recently it has been suggested as a theoretical foundation for deep learning.

It does so by relaxing the sufficiency condition to capture some fraction of the mutual information with the relevant variable Y.

The information bottleneck can also be viewed as a rate distortion problem, with a distortion function that measures how well Y is predicted from a compressed representation T compared to its direct prediction from X.

This generalization bound scale with the degree of information bottleneck, unlike the other generalization bounds that scale with the number of parameters, VC dimension, Rademacher complexity, stability or robustness.

Theory of Information Bottleneck is recently used to study Deep Neural Networks (DNN).

respectively quantify the amount of information that the hidden layer contains about the input and the output.

Saxe et al. in [4] countered the claim of Shwartz-Ziv and Tishby,[3] stating that this compression phenomenon in DNNs is not comprehensive, and it depends on the particular activation function.

Shwartz-Ziv and Tishby disputed these claims, arguing that Saxe et al. had not observed compression due to weak estimates of the mutual information.

Recently, Noshad et al. used a rate-optimal estimator of mutual information to explore this controversy, observing that the optimal hash-based estimator reveals the compression phenomenon in a wider range of networks with ReLu and maxpooling activations.

[5] On the other hand, recently Goldfeld et al. have argued that the observed compression is a result of geometric, and not of information-theoretic phenomena,[6] a view that has been shared also in.

of active eigenvectors in the projection, or order of approximation, is given by And we finally get In which the weights are given by where

Applying the Gaussian information bottleneck to time series (processes), yields solutions related to optimal predictive coding.

This procedure is formally equivalent to linear Slow Feature Analysis.

[9] Optimal temporal structures in linear dynamic systems can be revealed in the so-called past-future information bottleneck, an application of the bottleneck method to non-Gaussian sampled data.

[10] The concept, as treated by Creutzig, Tishby et al., is not without complication as two independent phases make up in the exercise: firstly estimation of the unknown parent probability densities from which the data samples are drawn and secondly the use of these densities within the information theoretic framework of the bottleneck.

Since the bottleneck method is framed in probabilistic rather than statistical terms, the underlying probability density at the sample points

are discussed in Silverman's Density Estimation for Statistics and Data Analysis.

Tishby et al. presented[1] the following iterative set of equations to determine the clusters which are ultimately a generalization of the Blahut-Arimoto algorithm, developed in rate distortion theory.

The application of this type of algorithm in neural networks appears to originate in entropy arguments arising in the application of Gibbs Distributions in deterministic annealing.

is applied to assess the fidelity of the compressed vector with respect to the reference (or categorical) data

The weighting by the negative exponent of the distance means that prior cluster probabilities are downweighted in line 1 when the Kullback–Leibler divergence is large, thus successful clusters grow in probability while unsuccessful ones decay.

and the matrix valued Kullback–Leibler divergence function derived from the sample spacings and transition probabilities.

Secondly apply the last two lines of the 3-line algorithm to get cluster and conditional category probabilities.

The following case examines clustering in a four quadrant multiplier with random inputs

This function has two spatially separated clusters for each category and so demonstrates that the method can handle such distributions.

The contour at the unity likelihood ratio level is shown, as a new sample

This algorithm is somewhat analogous to a neural network with a single hidden layer.

However, unlike a standard neural network, the algorithm relies entirely on probabilities as inputs rather than the sample values themselves, while internal and output values are all conditional probability density distributions.

The Blahut-Arimoto three-line algorithm converges rapidly, often in tens of iterations, and by varying