Bag-of-words model in computer vision

In computer vision, a bag of visual words is a vector of occurrence counts of a vocabulary of local image features.

[1][2][3] A definition of the BoW model can be the "histogram representation based on independent features".

A good descriptor should have the ability to handle intensity, rotation, scale and affine variations to some extent.

Computer vision researchers have developed several learning methods to leverage the BoW model for image related tasks, such as object categorization.

The categorization decision is made by Since the Naive Bayes classifier is simple yet effective, it is usually used as a baseline method for comparison.

The local feature approach of using BoW model representation learnt by machine learning classifiers with different kernels (e.g., EMD-kernel and

This approach[12] has achieved very impressive results in the PASCAL Visual Object Classes Challenge.

Pyramid match kernel[13] is a fast algorithm (linear complexity instead of classic one in quadratic complexity) kernel function (satisfying Mercer's condition) which maps the BoW features, or set of features in high dimension, to multi-dimensional multi-resolution histograms.

The pyramid match kernel builds multi-resolution histograms by binning data points into discrete regions of increasing size.

[13][14] One of the notorious disadvantages of BoW is that it ignores the spatial relationships among the patches, which are very important in image representation.

The hierarchical shape and appearance model for human action[18] introduces a new part layer (Constellation model) between the mixture proportion and the BoW features, which captures the spatial relationships among parts in the layer.

For discriminative models, spatial pyramid match[19] performs pyramid matching by partitioning the image into increasingly fine sub-regions and compute histograms of local features inside each sub-region.

[4] A systematic comparison of classification pipelines found that the encoding of first and second order statistics (Vector of Locally Aggregated Descriptors (VLAD)[22] and Fisher Vector (FV)) considerably increased classification accuracy compared to BoW, while also decreasing the codebook size, thus lowering the computational effort for codebook generation.

[23] Moreover, a recent detailed comparison of coding and pooling methods[21] for BoW has showed that second order statistics combined with Sparse Coding and an appropriate pooling such as Power Normalisation can further outperform Fisher Vectors and even approach results of simple models of Convolutional Neural Network on some object recognition datasets such as Oxford Flower Dataset 102.