Differential privacy

It enables a data holder to share aggregate patterns of the group while limiting information that is leaked about specific individuals.

For example, differentially private algorithms are used by some government agencies to publish demographic information or other statistical aggregates while ensuring confidentiality of survey responses, and by companies to collect information about user behavior while controlling what is visible even to internal analysts.

Roughly, an algorithm is differentially private if an observer seeing its output cannot tell whether a particular individual's information was used in the computation.

Differential privacy is often discussed in the context of identifying individuals whose information may be in a database.

[3] The 2006 Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam D. Smith article[3] introduced the concept of ε-differential privacy, a mathematical definition for the privacy loss associated with any data release drawn from a statistical database.

[4] (Here, the term statistical database means a set of data that are collected under the pledge of confidentiality for the purpose of producing statistics that, by their production, do not compromise the privacy of those individuals who provided the data.)

[5] That is, the statistical functions run on the database should not be substantially affected by the removal, addition, or change of any individual in the data.

The key insight of differential privacy is that as the query is made on the data of fewer and fewer people, more noise needs to be added to the query result to produce the same amount of privacy.

be a randomized algorithm that takes a dataset as input (representing the actions of the trusted party holding the data).

[citation needed] Differential privacy offers strong and robust guarantees that facilitate modular design and analysis of differentially private mechanisms due to its composability, robustness to post-processing, and graceful degradation in the presence of correlated data.

The definition gives a strong guarantee that presence or absence of an individual will not affect the final output of the algorithm significantly.

For example: Now suppose a malicious user (often termed an adversary) wants to find whether Chandler has diabetes or not.

[3] The other important property for modular use of differential privacy is robustness to post processing.

[3] The property of composition permits modular construction and analysis of differentially private mechanisms[3] and motivates the concept of the privacy loss budget.

[citation needed] If all elements that access sensitive data of a complex mechanisms are separately differentially private, so will be their combination, followed by arbitrary post-processing.

Some of these, like the Laplace mechanism, described below, rely on adding controlled noise to the function that we want to compute.

This can be generalized to other metric spaces (measures of distance), and must be to make certain differentially private algorithms work, including adding noise from the Gaussian distribution (which requires the L2 norm) instead of the Laplace distribution.

[3] There are techniques (which are described below) using which we can create a differentially private algorithm for functions, with parameters that vary depending on their sensitivity.

[8] A simple example, especially developed in the social sciences,[12] is to ask a person to answer the question "Do you own the attribute A?

Hence it is possible to estimate p. In particular, if the attribute A is synonymous with illegal behavior, then answering "Yes" is not incriminating, insofar as the person has a probability of a "Yes" response, whatever it may be.

[citation needed] In 1977, Tore Dalenius formalized the mathematics of cell suppression.

[15] Tore Dalenius was a Swedish statistician who contributed to statistical privacy through his 1977 paper that revealed a key point about statistical databases, which was that databases should not reveal information about an individual that is not otherwise accessible.

[4] In 1979, Dorothy Denning, Peter J. Denning and Mayer D. Schwartz formalized the concept of a Tracker, an adversary that could learn the confidential contents of a statistical database by creating a series of targeted queries and remembering the results.

[citation needed] In 2003, Kobbi Nissim and Irit Dinur demonstrated that it is impossible to publish arbitrary queries on a private statistical database without revealing some amount of private information, and that the entire information content of the database can be revealed by publishing the results of a surprisingly small number of random queries—far fewer than was implied by previous work.

[18] The general phenomenon is known as the Fundamental Law of Information Recovery, and its key insight, namely that in the most general case, privacy cannot be protected without injecting some amount of noise, led to development of differential privacy.

[citation needed] In 2006, Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam D. Smith published an article[3] formalizing the amount of noise that needed to be added and proposing a generalized mechanism for doing so.

[citation needed] This paper also created the first formal definition of differential privacy.

[20] Since then, subsequent research has shown that there are many ways to produce very accurate statistics from the database while still ensuring high levels of privacy.

[1] To date there are over 12 real-world deployments of differential privacy, the most noteworthy being: There are several public purpose considerations regarding differential privacy that are important to consider, especially for policymakers and policy-focused audiences interested in the social opportunities and risks of the technology:[30] Because differential privacy techniques are implemented on real computers, they are vulnerable to various attacks not possible to compensate for solely in the mathematics of the techniques themselves.

In addition to standard defects of software artifacts that can be identified using testing or fuzzing, implementations of differentially private mechanisms may suffer from the following vulnerabilities:

An informal definition of differential privacy
A formal definition of ε-differential privacy. is a dataset without the private data, and is one with it. This is "pure ε-differential privacy", meaning δ=0.
Laplace mechanism offering .5-differential privacy for a function with sensitivity 1.