Neural scaling law

In general, a deep learning model can be characterized by four parameters: model size, training dataset size, training cost, and the post-training error rate (e.g., the test set error rate).

In comparison, most other kinds of neural networks, such as transformer models, always use all their parameters during inference.

Larger training datasets are typically preferred, as they provide a richer and more diverse source of information from which the model can learn.

This can lead to improved generalization performance when the model is applied to new, unseen data.

The performance of a neural network model is evaluated based on its ability to accurately predict the output given some input data.

The 2017 paper[2] is a common reference point for neural scaling laws fitted by statistical analysis on experimental data.

Previous works before the 2000s, as cited in the paper, were either theoretical or orders of magnitude smaller in scale.

They also found that for a given architecture, the number of parameters necessary to reach lowest levels of loss, given a fixed dataset size, grows like

), and speech recognition with two hybrid (LSTMs complemented by either CNNs or an attention decoder) architectures (

), to achieve the minimal pretraining loss for that budget, the number of model parameters (

This discrepancy can primarily be attributed to the studies using different methods for measuring model size; Kaplan et al. counted only non-embedding paramters, which when analyzed at smaller model sizes leads to biased coefficients.

[15] Secondary effects also arise due to differences in hyperparameter tuning and learning rate schedules.

Usually, the goal is to make the scaling law exponent larger, which means the same loss can be trained for much less compute.

For instance, filtering data can make the scaling law exponent larger.

[17] Another strand of research studies how to deal with limited data, as according to Chinchilla scaling laws, the training dataset size for the largest language models already approaches what is available on the internet.

Subsequent studies discovered scaling laws in the overtraining regime, for dataset sizes up to 32x more than Chinchilla-optimal.

[22] A 2022 analysis[23] found that many scaling behaviors of artificial neural networks follow a smoothly broken power law functional form:

refers to the downstream (or upstream) performance evaluation metric of interest (e.g. prediction error, cross entropy, calibration error, AUROC, BLEU score percentage, F1 score, reward, Elo rating, solve rate, or FID score) in zero-shot, prompted, or fine-tuned settings.

is subtracted out from the y-axis, this functional form looks like a series of linear segments connected by arcs; the

transitions between the segments are called "breaks", hence the name broken neural scaling laws (BNSL).

As an example, the Elo rating of AlphaGo improves steadily as it is allowed to spend more time on its Monte Carlo Tree Search per play.

[25]: Fig 4 For AlphaGo Zero, increasing Elo by 120 requires either 2x model size and training, or 2x test-time search.

[26] Similarly, a language model for solving competition-level coding challenges, AlphaCode, consistently improved (log-linearly) in performance with more search time.

[7] For Libratus for heads up no-limit Texas hold 'em, and Cicero for Diplomacy, and many other abstract games of partial information, inference-time searching improves performance at a similar tradeoff ratio, for up to 100,000x effective increase in training-time compute.

[28][29] One method for scaling up test-time compute is process-based supervision, where a model generates a step-by-step reasoning chain to answer a question, and another model (either human or AI) provides a reward score on some of the intermediate steps, not just the final answer.

be the error probability of the finetuned model classifying ImageNet test set.

Ghorbani, Behrooz et al.[32] studied scaling laws for neural machine translation (specifically, English as source, and German as target) in encoder-decoder Transformer models, trained until convergence on the same datasets (thus they did not fit scaling laws for computing cost

They found the Kaplan et al (2020)[13] scaling law applied to machine translation:

Hernandez, Danny et al.[35] studied scaling laws for transfer learning in language models.

They trained a family of Transformers in three ways: The idea is that pretraining on English should help the model achieve low loss on a test set of Python text.