In neural networks, a pooling layer is a kind of network layer that downsamples and aggregates information that is dispersed among many vectors into fewer vectors.
It removes redundant information, reducing the amount of computation and memory required, makes the model more robust to small variations in the input, and increases the receptive field of neurons in later layers in the network.
Pooling is most commonly used in convolutional neural networks (CNN).
Sometimes, it is necessary to use a different filter size and stride for horizontal and vertical directions.
Max Pooling (MaxPool) is commonly used in CNNs to reduce the spatial dimensions of feature maps.
If the horizontal and vertical filter size and strides differ, then in general,
is an integer, then for computing the entries of the output tensor on the boundaries, max pooling would attempt to take as inputs variables off the tensor.
In this case, how those non-existent variables are handled depends on the padding conditions, illustrated on the right.
It is often used just before the final fully connected layers in a CNN classification head.
Global Average Pooling (GAP) is defined similarly to GMP.
[2] Similarly to GMP, it is often used just before the final fully connected layers in a CNN classification head.
is either a hyperparameter, a learnable parameter, or randomly sampled anew every time.
[11] In Vision Transformers (ViT), there are the following common kinds of poolings.
BERT-like pooling uses a dummy [CLS] token ("classification").
For classification, the output at [CLS] is the classification token, which is then processed by a LayerNorm-feedforward-softmax module into a probability distribution, which is the network's prediction of class probability distribution.
[15] Later papers demonstrated that GAP and MAP both perform better than BERT-like pooling.
[14][16] In graph neural networks (GNN), there are also two forms of pooling: global and local.
Local pooling layers coarsen the graph via downsampling.
We present here several learnable local pooling strategies that have been proposed.
[19] For each cases, the input is the initial graph is represented by a matrix
is the subset of nodes with the top-k highest projection scores,
In other words, the nodes with the top-k highest projection scores are retained in the new adjacency matrix
is a generic permutation equivariant GNN layer (e.g., GCN, GAT, MPNN).
is the subset of nodes with the top-k highest projection scores,
Differently from top-k pooling, the self-attention scores computed in self-attention pooling account both for the graph features and the graph topology.
This was given a functional explanation as "local pooling", which makes vision translation-invariant.
(Hartline, 1940)[20] gave supporting evidence for the theory by electrophysiological experiments on the receptive fields of retinal ganglion cells.
The Hubel and Wiesel experiments showed that the vision system in cats is similar to a convolutional neural network, with some cells summing over inputs from the lower layer.[21]: Fig.
During the 1970s, to explain the effects of depth perception, some such as (Julesz and Chang, 1976)[23] proposed that the vision system implements a disparity-selective mechanism by global pooling, where the outputs from matching pairs of retinal regions in the two eyes are pooled in higher order cells.
In artificial neural networks, max pooling was used in 1990 for speech processing (1-dimensional convolution).