Ranking (information retrieval)

Ranking of query is one of the fundamental problems in information retrieval (IR),[1] the scientific/engineering discipline behind search engines.

Ranking in terms of information retrieval is an important concept in computer science and is used in many different applications such as search engine queries and recommender systems.

[3] A majority of search engines use ranking algorithms to provide users with accurate and relevant results.

[4] The notion of page rank dates back to the 1940s and the idea originated in the field of economics.

Jon Kleinberg, a computer scientist at Cornell University, developed an almost identical approach to PageRank which was called Hypertext Induced Topic Search or HITS and it treated web pages as "hubs" and "authorities".

[7] All the above methods are somewhat similar as all of them exploit the structure of links and require an iterative approach.

Boolean Model or BIR is a simple baseline query model where each query follows the underlying principles of relational algebra with algebraic expressions and where documents are not fetched unless they completely match with each other.

The similarity score between query and document can be found by calculating cosine value between query weight vector and document weight vector using cosine similarity.

The probability model of information retrieval was introduced by Maron and Kuhns in 1960 and further developed by Roberston and other researchers.

According to Spack Jones and Willett (1997): The rationale for introducing probabilistic concepts is obvious: IR systems deal with natural language, and this is too far imprecise to enable a system to state with certainty which document will be relevant to a particular query.

The “event” in this context of information retrieval refers to the probability of relevance between a query and a document.

The model adopts various methods to determine the probability of relevance between queries and documents.

According to Gerard Salton and Michael J. McGill, the essence of this model is that if estimates for the probability of occurrence of various terms in relevant documents can be calculated, then the probabilities that a document will be retrieved, given that it is relevant, or that it is not, can be estimated.

For each such set, precision and recall values can be plotted to give a precision-recall curve.

If P is the precision and R is the recall then the F-Score is given by: The PageRank algorithm outputs a probability distribution used to represent the likelihood that a person randomly clicking on the links will arrive at any particular page.

It is assumed in several research papers that the distribution is evenly divided among all documents in the collection at the beginning of the computational process.

[16][17] Accounting for multiple objectives when constructing the final item ranking results in a time-intensive optimization problem[18][19] and substantial research effort has focused on speeding up the optimization to keep in check the perceived latency of obtaining the ranking by the user.