Boolean model of information retrieval

The (standard) Boolean model of information retrieval (BIR)[1] is a classical information retrieval (IR) model and, at the same time, the first and most-adopted one.

[2] The BIR is based on Boolean logic and classical set theory in that both the documents to be searched and the user's query are conceived as sets of terms (a bag-of-words model).

Retrieval is based on whether or not the documents contain the query terms and whether they satisfy the boolean conditions described by the query.

An index term is a word or expression, which may be stemmed, describing or characterizing a document, such as a keyword given for a journal article.

is a series of words or small phrases (index terms).

The words or small phrases (index terms

) can contain words or small phrases (index terms

Index terms generally want to represent words which have more meaning to them and corresponds to what the content of an article or document could talk about.

Therefor, rarer terms like "Bayesian" are a better choice to be selected in the

This relates to Entropy (information theory).

There are multiple types of operations that can be applied to index terms used in queries to make them more generic and more relevant.

of terms which are combined using Boolean operators to form a set of conditions.

We seek to find the set of documents that satisfy

This operation is called retrieval and consists of the following two steps: Let the set of original (real) documents be, for example where

= "Bayes' principle: The principle that, in estimating a parameter, one should initially assume that each possible value has equal probability (a uniform prior distribution)."

= "Bayesian decision theory: A mathematical theory of decision-making which presumes utility and probability functions, and according to which the act to be chosen is the Bayes act, i.e. the one with highest subjective expected utility.

= "Bayesian epistemology: A philosophical theory which holds that the epistemic status of a proposition (i.e. how well proven or well established it is) is best measured by a probability and that the proper way to revise this probability is given by Bayesian conditionalisation or similar procedures.

A Bayesian epistemologist would use probability to define, and explore the relationship between, concepts such as epistemic status, support or explanatory power."

{\displaystyle T=\{t_{1}={\text{Bayes' principle}},t_{2}={\text{probability}},t_{3}={\text{decision-making}},t_{4}={\text{Bayesian epistemology}}\}}

If there is more than one document with the same representation (the same subset of index terms

Such documents are indistinguishable in the BIR (in other words, equivalent).

From a pure formal mathematical point of view, the BIR is straightforward.

From a practical point of view, however, several further problems should be solved that relate to algorithms and data structures, such as, for example, the choice of terms (manual or automatic selection or both), stemming, hash tables, inverted file structure, and so on.

Since hash table size increases and decreases in real time with the addition and removal of terms, each document will occupy much less space in memory.

However, it will have a slowdown in performance because the operations are more complex than with bit vectors.

On the average case, the performance slowdown will not be that much worse than bit vectors and the space usage is much more efficient.

Each document can be summarized by Bloom filter representing the set of words in that document, stored in a fixed-length bitstring, called a signature.

The signature file contains one such superimposed code bitstring for every document in the collection.

Each query can also be summarized by a Bloom filter representing the set of words in the query, stored in a bitstring of the same fixed length.

An inverted index file contains two parts: a vocabulary containing all the terms used in the collection, and for each distinct term an inverted index that lists every document that mentions that term.