Visual words, as used in image retrieval systems,[1] refer to small parts of an image that carry some kind of information related to the features (such as the color, shape, or texture) or changes occurring in the pixels such as the filtering, low-level feature descriptors (SIFT or SURF).
Text-search engines are able to quickly find documents from hundreds or millions (by using a vector space model[2]).
That can be accomplished by a new kind of vision to understand images as textual documents, which is the visual words approach.
A number of solutions exist to solve this problem, such as dividing the feature space into ranges, each having common characteristics (which can be considered as the same word).
Nonetheless, this solution carries many issues, like the division strategy and the size of the range in the feature space.
Another solution proposed by researchers is using a clustering mechanism to classify and merge words carrying common information in a finite number of terms.