t-closeness

Given the existence of data breaches where sensitive attributes may be inferred based upon the distribution of values for l-diverse data, the t-closeness method was created to further l-diversity by additionally maintaining the distribution of sensitive fields.

The original paper[1] by Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian defines t-closeness as: The t-closeness Principle: An equivalence class is said to have t-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole table is no more than a threshold t. A table is said to have t-closeness if all equivalence classes have t-closeness.Charu Aggarwal and Philip S. Yu further state in their book on privacy-preserving data mining[2] that with this definition, threshold t gives an upper bound on the difference between the distribution of the sensitive attribute values within an anonymized group as compared to the global distribution of values.

They also state that for numeric attributes, using t-closeness anonymization is more effective than many other privacy-preserving data mining methods.

In real data sets attribute values may be skewed or semantically similar.

Alternatively, sensitive information leaks may occur because while l-diversity requirement ensures “diversity” of sensitive values in each group, it does not recognize that values may be semantically close, for example, an attacker could deduce a stomach disease applies to an individual if a sample containing the individual only listed three different stomach diseases.