Truth discovery

Several algorithms have been proposed to tackle this problem, ranging from simple methods like majority voting to more complex ones able to estimate the trustworthiness of data sources.

In the first case only one true value is allowed for a data item (e.g birthday of a person, capital city of a country).

This, together with the fact that we are increasing our reliance on data to derive important decisions, motivates the need of developing good truth discovery algorithms.

[5] Many currently available methods rely on a voting strategy to define the true value of a data item.

Nevertheless, recent studies, have shown that, if we rely only on majority voting, we could get wrong results even in 30% of the data items.

On the other hand, in the second case (second table), sources 2 and 3 are neither correct nor erroneous, they instead provide a subset of the true values and at the same time they do not oppose the truth.

[1][3] Below are reported some of the characteristics of the most relevant typologies of single-truth methods and how different systems model source trustworthiness.

[7][10] Other more complex methods exploit Bayesian inference to detect copying behaviors and use these insights to better assess source trustworthiness.

[2] More sophisticated methods also consider domain coverage and copying behaviors to better estimate source trustworthiness.

[2][3] These methods use probabilistic graphical models to automatically define the set of true values of given data item and also to assess source quality without need of any supervision.

Typical domains of application include: healthcare, crowd/social sensing, crowdsourcing aggregation, information extraction and knowledge base construction.

[1] Truth discovery algorithms could be also used to revolutionize the way in which web pages are ranked in search engines, going from current methods based on link analysis like PageRank, to procedures that rank web pages based on the accuracy of the information they provide.