[4] In 2000, Geoffrey Hinton et al. described an imaging system that combined segmentation and recognition into a single inference process using parse trees.
So-called credibility networks described the joint distribution over the latent variables and over the possible parse trees.
[1] In Hinton's original idea one minicolumn would represent and detect one multidimensional entity.
For example, transforming a circle into an ellipse means that its perimeter can no longer be computed as π times the diameter.
Equivariant properties such as a spatial relationship are captured in a pose, data that describes an object's translation, rotation, scale and reflection.
[1] Unsupervised capsnets learn a global linear manifold between an object and its pose as a matrix of weights.
Capsnet proponents argue that pooling:[1] A capsule is a set of neurons that individually activate for various properties of a type of object, such as position, size and hue.
[1][3] Artificial neurons traditionally output a scalar, real-valued activation that loosely represents the probability of an observation.
Capsnets replace scalar-output feature detectors with vector-output capsules and max-pooling with routing-by-agreement.
A minimal cluster of two capsules considering a six-dimensional entity would agree within 10% by chance only once in a million trials.
This is similar to the Hough transform, the RHT and RANSAC from classic digital image processing.
[1] For each possible parent, each child computes a prediction vector by multiplying its output by a weight matrix (trained by backpropagation).
[1] The more children whose predictions are close to a parent's output, the more quickly the coefficients grow, driving convergence.
[3] The coefficients' initial logits are the log prior probabilities that a child belongs to a parent.
The priors depend on the location and type of the child and parent capsules, but not on the current input.
At each iteration, the coefficients are adjusted via a "routing" softmax so that they continue to sum to 1 (to express the probability that a given capsule is the parent of a given child.)
Similarly, the probability that a feature is present in the input is exaggerated by a nonlinear "squashing" function that reduces values (smaller ones drastically and larger ones such that they are less than 1).
[3] This dynamic routing mechanism provides the necessary deprecation of alternatives ("explaining away") that is needed for segmenting overlapped objects.
in the next higher level are fed the sum of the predictions from all capsules in the lower layer, each with a coupling coefficient
The structure in layer I and II is somewhat similar to the cerebral cortex if stellate cells are assumed to be involved in transposing input vectors.
[1] The length of the instantiation vector represents the probability that a capsule's entity is present in the scene.
[1] An additional reconstruction loss encourages entities to encode their inputs' instantiation parameters.
The final activity vector is then used to reconstruct the input image via a CNN decoder consisting of 3 fully connected layers.
The reconstruction minimizes the sum of squared differences between the outputs of the logistic units and the pixel intensities.
Capsule activations effectively invert the graphics rendering process, going from pixels to features.
[1] Human vision examines a sequence of focal points (directed by saccades), processing only a fraction of the scene at its highest resolution.
A minicolumn is a structure containing 80-120 neurons, with a diameter of about 28-40 μm, spanning all layers in the cerebral cortex.
All neurons in the larger minicolumns have the same receptive field, and they output their activations as action potentials or spikes.
[8] Capsnets explore the intuition that the human visual system creates a tree-like structure for each focal point and coordinates these trees to recognize objects.
However, with capsnets each tree is "carved" from a fixed network (by adjusting coefficients) rather than assembled on the fly.