Latent semantic analysis

LSA assumes that words that are close in meaning will occur in similar pieces of text (the distributional hypothesis).

A matrix containing word counts per document (rows represent unique words and columns represent each document) is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns.

[1] An information retrieval technique using latent semantic structure was patented in 1988[2] by Scott Deerwester, Susan Dumais, George Furnas, Richard Harshman, Thomas Landauer, Karen Lochbaum and Lynn Streeter.

This is called a singular value decomposition (SVD): The matrix products giving us the term and document correlations then become Since

You can do the same for pseudo term vectors: The new low-dimensional space typically can be used to: Synonymy and polysemy are fundamental problems in natural language processing: LSA has been used to assist in performing prior art searches for patents.

In recent years progress has been made to reduce the computational complexity of SVD; for instance, by using a parallel ARPACK algorithm to perform parallel eigenvalue decomposition it is possible to speed up the SVD computation cost while providing comparable prediction quality.

Deep neural network essentially builds a graphical model of the word-count vectors obtained from a large set of documents.

This way of extending the efficiency of hash-coding to approximate matching is much faster than locality sensitive hashing, which is the fastest current method.

[clarification needed] Latent semantic indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text.

A key feature of LSI is its ability to extract the conceptual content of a body of text by establishing associations between those terms that occur in similar contexts.

[22] LSI is also an application of correspondence analysis, a multivariate statistical technique developed by Jean-Paul Benzécri[23] in the early 1970s, to a contingency table built from word counts in documents.

The method, also called latent semantic analysis (LSA), uncovers the underlying latent semantic structure in the usage of words in a body of text and how it can be used to extract the meaning of the text in response to user queries, commonly referred to as concept searches.

LSI helps overcome synonymy by increasing recall, one of the most problematic constraints of Boolean keyword queries and vector space models.

In fact, several experiments have demonstrated that there are a number of correlations between the way LSI and humans process and categorize text.

This enables LSI to elicit the semantic content of information written in any language without requiring the use of auxiliary structures, such as dictionaries and thesauri.

[27] LSI automatically adapts to new and changing terminology, and has been shown to be very tolerant of noise (i.e., misspelled words, typographical errors, unreadable characters, etc.).

Once a term-document matrix is constructed, local and global weighting functions can be applied to it to condition the data.

Empirical studies with LSI report that the Log and Entropy weighting functions work well, in practice, with many data sets.

is computed as: A rank-reduced, singular value decomposition is performed on the matrix to determine patterns in the relationships between the terms and concepts contained in the text.

The SVD is then truncated to reduce the rank by keeping only the largest k « r diagonal entries in the singular value matrix S, where k is typically on the order 100 to 300 dimensions.

The SVD operation, along with this reduction, has the effect of preserving the most important semantic information in the text while reducing noise and other undesirable artifacts of the original space of A.

A drawback to computing vectors in this way, when adding new searchable documents, is that terms that were not known during the SVD phase for the original index are ignored.

These terms will have no impact on the global weights and learned correlations derived from the original collection of text.

It is generally acknowledged that the ability to work with text on a semantic basis is essential to modern information retrieval systems.

As a result, the use of LSI has significantly expanded in recent years as earlier challenges in scalability and performance have been overcome.

In eDiscovery, the ability to cluster, categorize, and search large collections of unstructured text on a conceptual basis is essential.

LSI requires relatively high computational performance and memory in comparison to other information retrieval techniques.

A fully scalable (unlimited number of documents, online training) implementation of LSI is contained in the open source gensim software package.

[55] Checking the proportion of variance retained, similar to PCA or factor analysis, to determine the optimal dimensionality is not suitable for LSI.

[56] When LSI topics are used as features in supervised learning methods, one can use prediction error measurements to find the ideal dimensionality.

Animation of the topic detection process in a document-word matrix. Every column corresponds to a document, every row to a word. A cell stores the weighting of a word in a document (e.g. by tf-idf ), dark cells indicate high weights. LSA groups both documents that contain similar words, as well as words that occur in a similar set of documents. The resulting patterns are used to detect latent components. ^{[

4

]}