Evaluation measures (information retrieval)

Evaluation measures for an information retrieval (IR) system assess how well an index, search engine, or database returns results from a collection of resources that satisfy a user's query.

The most important factor in determining a system's effectiveness for users is the overall relevance of results retrieved in response to a query.

[1] The success of an IR system may be judged by a range of criteria including relevance, speed, user satisfaction, usability, efficiency and reliability.

[2] Evaluation measures may be categorised in various ways including offline or online, user-based or system-based and include methods such as observed user behaviour, test collections, precision and recall, and scores from prepared benchmark test sets.

[3] Measures are generally used in two settings: online experimentation, which assesses users' interactions with the search system, and offline evaluation, which measures the effectiveness of an information retrieval system on a static offline collection.

Indexing and classification methods to assist with information retrieval have a long history dating back to the earliest libraries and collections however systematic evaluation of their effectiveness began in earnest in the 1950s with the rapid expansion in research production across military, government and education and the introduction of computerised catalogues.

At this time there were a number of different indexing, classification and cataloguing systems in operation which were expensive to produce and it was unclear which was the most effective.

Cleverdon’s experiments established a number of key aspects required for IR evaluation: a test collection, a set of queries and a set of pre-determined relevant items which combined would determine precision and recall.

Cleverdon's approach formed a blueprint for the successful Text Retrieval Conference series that began in 1992.

Evaluations measures are used in studies of information behaviour, usability testing, business costs and efficiency assessments.

The metric either indicates a recall issue, or that the information being searched for is not in the index.

Offline metrics are generally created from relevance judgment sessions where the judges score the quality of the search results.

Both binary (relevant/non-relevant) and multi-level (e.g., relevance from 0 to 5) scales can be used to score each document returned in response to a query.

Precision is the fraction of the documents retrieved that are relevant to the user's information need.

Recall is the fraction of the documents that are relevant to the query that are successfully retrieved.

It is trivial to achieve recall of 100% by returning all documents in response to any query.

It is trivial to achieve fall-out of 0% by returning zero documents in response to any query.

The weighted harmonic mean of precision and recall, the traditional F-measure or balanced F-score is: This is also known as the

Their relationship is: Since F-measure combines information from both precision and recall it is a way to represent overall performance without presenting two numbers.

Precision and recall are single-value metrics based on the whole list of documents returned by the system.

This integral is in practice replaced with a finite sum over every position in the ranked sequence of documents: where

[9][10] For example, the PASCAL Visual Object Classes challenge (a benchmark for computer vision object detection) until 2010[11] computed the average precision by averaging the precision over a set of evenly spaced recall levels {0, 0.1, 0.2, ... 1.0}:[9][10] where

function by assuming a particular parametric distribution for the underlying decision values.

For example, a binormal precision-recall curve can be obtained by assuming decision values in both classes to follow a Gaussian distribution.

[13] For modern (web-scale) information retrieval, recall is no longer a meaningful metric, as many queries have thousands of relevant documents, and few users will be interested in reading all of them.

Precision at k documents (P@k) is still a useful metric (e.g., P@10 or "Precision at 10" corresponds to the number of relevant results among the top 10 retrieved documents), but fails to take into account the positions of the relevant documents among the top k.[14] Another shortcoming is that on a query with fewer relevant results than k, even a perfect system will have a score less than 1.

[15] It is easier to score manually since only the top k results need to be examined to determine if they are relevant or not.

The premise of DCG is that highly relevant documents appearing lower in a search result list should be penalized as the graded relevance value is reduced logarithmically proportional to the position of the result.

To this end, it sorts documents of a result list by relevance, producing an ideal DCG at position p (

All nDCG calculations are then relative values on the interval 0.0 to 1.0 and so are cross-query comparable.