Model collapse

Model collapse[note 1] is a phenomenon where machine learning models gradually degrade due to errors coming from uncurated training on the outputs of another model, including prior versions of itself.

Shumailov et al.[9] coined the term and described two specific stages to the degradation: early model collapse and late model collapse.

In early model collapse, the model begins losing information about the tails of the distribution – mostly affecting minority data.

Later work highlighted that early model collapse is hard to notice, since overall performance may appear to improve, while the model loses performance on minority data.

[13] In late model collapse, the model loses a significant proportion of its performance, confusing concepts and losing most of its variance.

[9] Importantly, it happens in even the simplest of models, where not all of the error sources are present.

In more complex models the errors often compound, leading to faster collapse.

Some researchers and commentators on model collapse warn that the phenomenon could fundamentally threaten future generative AI development: As AI-generated data is shared on the Internet, it will inevitably end up in future training datasets, which are often crawled from the Internet.

If training on "slop" (large quantities of unlabeled synthetic data) inevitably leads to model collapse, this could therefore pose a difficult problem.

[16] However, recently, other researchers have disagreed with this argument, showing that if synthetic data accumulates alongside human-generated data, model collapse is avoided.

[17] The researchers argue that data accumulating over time is a more realistic description of reality than deleting all existing data every year, and that the real-world impact of model collapse may not be as catastrophic as feared.

[18] An alternative branch of the literature investigates the use of machine learning detectors and watermarking to identify model generated data and filter it out.

[19][20] In 2024,[9] a first attempt has been made at illustrating collapse for the simplest possible model — a single dimensional normal distribution fit using unbiased estimators of mean and variance, computed on samples from the previous generation.

To make this more precise, we say that original data follows a normal distribution

, then the next generation model is estimated using the sample mean and variance:

Leading to a conditionally normal next generation model

To continue the analysis, instead of writing the probability density function at each generation, it is possible to explicitly construct them in terms of independent random variables using Cochran's theorem.

on its own the formula above provides all the information about the full distribution.

To analyse the model collapse, we can first calculate variance and mean of samples at generation

It is possible to find its exact value in closed form, but the mean and variance of the square root of gamma distribution are expressed in terms of gamma functions, making the result quite clunky.

This is the same scaling as for a single dimensional Gaussian random walk.

Due to errors from re-sampling the approximated distribution, each generation ends up corresponding to a new step in a random walk of model parameters.

For a constant sample size at each generation, the average distance from the starting point diverges, and in order for the end distribution approximation to be accurate, or for the distance to be finite, the sampling rate

Overall, this only shows us how far on average one ends up from the original distribution, but the process can only "terminate", if the estimated variance at a certain generation becomes small enough, effectively turning the distribution into a delta function.

This is shown to occur for a general gaussian model[14] in the subsection below.

Empirical investigation has confirmed this theoretical analysis.

[21] Furthermore, in the case of multidimensional model with fully synthetic data, exact collapse can be shown.

[14][9] In the case of a linear regression model,[22][23] scaling laws and bounds on learning can be obtained.

In the case of a linear softmax classifier for next token prediction,[24] exact bounds on learning with even a partially synthetic dataset can be obtained.

In the context of large language models, research found that training LLMs on predecessor-generated text — language models are trained on the synthetic data produced by previous models — causes a consistent decrease in the lexical, syntactic, and semantic diversity of the model outputs through successive iterations, notably remarkable for tasks demanding high levels of creativity.