Document clustering

It has applications in automatic document organization, topic extraction and fast information retrieval or filtering.

The first one is the hierarchical based algorithm, which includes single link, complete linkage, group average and Ward's method.

By aggregating or dividing, documents can be clustered into hierarchical structure, which is suitable for browsing.

[1]: 499 Dimensionality reduction methods can be considered a subtype of soft clustering; for documents, these include latent semantic indexing (truncated singular value decomposition on term histograms)[2] and topic models.

A web search engine often returns thousands of pages in response to a broad query, making it difficult for users to browse or to identify relevant information.

Clustering methods can be used to automatically group the retrieved documents into a list of meaningful categories.

And we can avoid calculating similar information repeatedly by reducing all tokens to its base form using various stemming and lemmatization dictionaries.

For instance, common words such as "the" might not be very helpful for revealing the essential characteristics of a text.

Computing term frequencies or tf-idf After pre-processing the text data, we can then proceed to generate features.

And it is sometimes helpful to visualize the results by plotting the clusters into low (two) dimensional space.