k-anonymity

The example table below presents a fictional, non-anonymized database consisting of the patient records for a fictitious hospital.

The Name column is an identifier, Age, Gender, State of domicile, and Religion are quasi-identifiers, and Disease is a non-identifying sensitive value.

[6] The following example demonstrates a failing with k-anonymity: there may exist other data records that can be linked on the variables that are allegedly non-identifying.

For instance, suppose an attacker is able to obtain the log from the person who was taking vital signs as part of the study and learns that Kishor was at the hospital on April 30 and is 180 cm tall.

This information can be used to link with the "anonymized" database (which may have been published on the Internet) and learn that Kishor has a heart-related disease.

In fact, all values are potentially identifying, depending on their prevalence in the population and on auxiliary data that the attacker may have.

[1] Meyerson and Williams (2004) demonstrated that optimal k-anonymity is an NP-hard problem, however heuristic methods such as k-Optimize as given by Bayardo and Agrawal (2005) often yield effective results.

[9] While k-anonymity is a relatively simple-to-implement approach for de-identifying a dataset prior to public release, it is susceptible to many attacks.