Synthetic data

In many sensitive applications, datasets theoretically exist but cannot be released to the general public;[2] synthetic data sidesteps the privacy issues that arise from using real consumer information without permission or compensation.

A science article's abstract, quoted below, describes software that generates synthetic data for testing fraud detection systems.

[5] At the same time, synthetic data together with the testing approach can give the ability to model Scientific modelling of physical systems, which allows to run simulations in which one can estimate/compute/generate datapoints that haven't been observed in actual reality, has a long history that runs concurrent with the history of physics itself.

In the context of privacy-preserving statistical analysis, in 1993, the idea of original fully synthetic data was created by Rubin.

He then released samples that did not include any actual long form records - in this he preserved anonymity of the household.

[10]: 173 In 1994, Fienberg came up with the idea of critical refinement, in which he used a parametric posterior predictive distribution (instead of a Bayes bootstrap) to do the sampling.

[7] Later, other important contributors to the development of synthetic data generation were Trivellore Raghunathan, Jerry Reiter, Donald Rubin, John M. Abowd, and Jim Woodcock.

[7] Researchers test the framework on synthetic data, which is "the only source of ground truth on which they can objectively assess the performance of their algorithms".

The next step will be generating more synthetic data from the synthesizer build or from this linear line equation.

Specific algorithms and generators are designed to create realistic data, [14] which then assists in teaching a system how to react to certain situations or criteria.

Using synthetic data reduces confidentiality and privacy issues since it holds no personal information and cannot be traced back to any individual.

Advances in generative adversarial networks (GAN), lead to the natural idea that one can produce data and then use it for training.

"[21] The cover features a mapping of over 18,000 synthetically generated data points prompted from ChatGPT on the categories of knowledge.