Brown clustering

Brown, Vincent Della Pietra, Peter V. de Souza, Jennifer Lai, and Robert Mercer.

[1] The method, which is based on bigram language models,[2] is typically applied to text, grouping words into clusters that are assumed to be semantically related by virtue of their having been embedded in similar contexts.

Brown, Vincent Della Pietra, Peter de Souza, Jennifer Lai, and Robert Mercer of IBM in the context of language modeling.

[5] Jurafsky and Martin give the example of a flight reservation system that needs to estimate the likelihood of the bigram "to Shanghai", without having seen this in a training set.

As a result, the output can be thought of not only as a binary tree[6] but perhaps more helpfully as a sequence of merges, terminating with one big class of all words.