Record linkage

Record linkage is necessary when joining different data sets based on entities that may or may not share a common identifier (e.g., database key, URI, National identification number), which may be due to differences in record shape, storage location, or curator style or preference.

[4] Howard Borden Newcombe then laid the probabilistic foundations of modern record linkage theory in a 1959 article in Science.

[5] These were formalized in 1969 by Ivan Fellegi and Alan Sunter, in their pioneering work "A Theory For Record Linkage", where they proved that the probabilistic decision rule they described was optimal when the comparison attributes were conditionally independent.

[6] In their work they recognized the growing interest in applying advances in computing and automation to large collections of administrative data, and the Fellegi-Sunter theory remains the mathematical foundation for many record linkage applications.

[citation needed] On the other hand, machine learning or neural network algorithms that do not rely on these assumptions often provide far higher accuracy, when sufficient labeled training data is available.

Computer matching has the advantages of allowing central supervision of processing, better quality control, speed, consistency, and better reproducibility of results.

Many key identifiers for the same entity can be presented quite differently between (and even within) data sets, which can greatly complicate record linkage unless understood ahead of time.

Standardization can be accomplished through simple rule-based data transformations or more complex procedures such as lexicon-based tokenization and probabilistic hidden Markov models.

[9] Several of the packages listed in the Software Implementations section provide some of these features to simplify the process of data standardization.

Entity resolution engines then apply rules, based on common sense logic, to identify hidden relationships across the data.

These advanced technologies make automated decisions and impact business processes in real time, limiting the need for human intervention.

Handling exceptions such as missing identifiers involves the creation of additional record linkage rules.

One such rule in the case of missing SSN might be to compare name, date of birth, sex, and ZIP code with other records in hopes of finding a match.

Thus, another rule would need to be created to determine whether differences in particular identifiers are acceptable (such as ZIP code) and which are not (such as date of birth).

One study was able to link the Social Security Death Master File with two hospital registries from the Midwestern United States using SSN, NYSIIS-encoded first name, birth month, and sex, but these rules may not work as well with data sets from other geographic regions or with data collected on younger populations.

New data that exhibit different characteristics than was initially expected could require a complete rebuilding of the record linkage rule set, which could be a very time-consuming and expensive endeavor.

Probabilistic record linkage, sometimes called fuzzy matching (also probabilistic merging or fuzzy merging in the context of merging of databases), takes a different approach to the record linkage problem by taking into account a wider range of potential identifiers, computing weights for each identifier based on its estimated ability to correctly identify a match or a non-match, and using these weights to calculate the probability that two given records refer to the same entity.

Many probabilistic record linkage algorithms assign match/non-match weights to identifiers by means of two probabilities called

The resulting total weight is then compared to the aforementioned thresholds to determine whether the pair should be linked, non-linked, or set aside for special consideration (e.g. manual validation).

[12] Determining where to set the match/non-match thresholds is a balancing act between obtaining an acceptable sensitivity (or recall, the proportion of truly matching records that are linked by the algorithm) and positive predictive value (or precision, the proportion of records linked by the algorithm that truly do match).

Various manual and automated methods are available to predict the best thresholds, and some record linkage software packages have built-in tools to help the user find the most acceptable values.

Because this can be a very computationally demanding task, particularly for large data sets, a technique known as blocking is often used to improve efficiency.

Blocking attempts to restrict comparisons to just those records for which one or more particularly discriminating identifiers agree, which has the effect of increasing the positive predictive value (precision) at the expense of sensitivity (recall).

Blocking based on birth month, a more stable identifier that would be expected to change only in the case of data error, would provide a more modest gain in positive predictive value and loss in sensitivity, but would create only twelve distinct groups which, for extremely large data sets, may not provide much net improvement in computation speed.

[14][15] Higher accuracy can often be achieved by using various other machine learning techniques, including a single-layer perceptron,[7] random forest, and SVM.

High quality record linkage often requires a human–machine hybrid system to safely manage uncertainty in the ever changing streams of chaotic big data.

Interactive record linkage is defined as people iteratively fine tuning the results from the automated methods and managing the uncertainty and its propagation to subsequent analyses.

The techniques used in PPRL[24] must guarantee that no participating organisation, nor any external adversary, can compromise the privacy of the entities that are represented by records in the databases being linked.

For example, fetal and infant mortality is a general indicator of a country's socioeconomic development, public health, and maternal and child services.

Tracing is often needed for follow-up of industrial cohorts, clinical trials, and longitudinal surveys to obtain the cause of death and/or cancer.