Search engine indexing

Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval.

Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and computer science.

Popular search engines focus on the full-text indexing of online, natural language documents.

The purpose of storing an index is to optimize speed and performance in finding relevant documents for a search query.

Without an index, the search engine would scan every document in the corpus, which would require considerable time and computing power.

A major challenge in the design of search engines is the management of serial computing processes.

This increases the possibilities for incoherency and makes it more difficult to maintain a fully synchronized, distributed, parallel architecture.

To reduce computer storage memory requirements, it is stored differently from a two dimensional array.

The index is similar to the term document matrices employed by latent semantic analysis.

In some cases the index is a form of a binary tree, which requires additional storage but may reduce the lookup time.

The delineation enables asynchronous system processing, which partially circumvents the inverted index update bottleneck.

Generating or maintaining a large-scale search engine index represents a significant storage and processing challenge.

[citation needed] This space requirement may be even larger for a fault-tolerant distributed storage architecture.

Tokenization presents many challenges in extracting the necessary information from documents for indexing to support quality searching.

[citation needed] Unlike literate humans, computers do not understand the structure of a natural language document and cannot automatically recognize words and sentences.

Instead, humans must program the computer to identify what constitutes an individual or distinct word referred to as a token.

Many search engines, as well as other natural language processing software, incorporate specialized programs for parsing, such as YACC or Lex.

When identifying each token, several characteristics may be stored, such as the token's case (upper, lower, mixed, proper), language or encoding, lexical category (part of speech, like 'noun' or 'verb'), position, sentence number, sentence position, length, and line number.

For example, articles on the Wikipedia website display a side menu with links to other web pages.

Some indexers like Google and Bing ensure that the search engine does not take the large texts as relevant source due to strong type system compatibility.

[22] Meta tag indexing plays an important role in organizing and categorizing web content.

The design of the HTML markup language initially included support for meta tags for the very purpose of being properly and easily indexed, without requiring tokenization.

The fact that these keywords were subjectively specified was leading to spamdexing, which drove many search engines to adopt full-text indexing technologies in the 1990s.

Search engine designers and companies could only place so many 'marketing keywords' into the content of a webpage before draining it of all interesting and useful information.

In this sense, full-text indexing was more objective and increased the quality of search engine results, as it was one more step away from subjective control of search engine result placement, which in turn furthered research of full-text indexing technologies.