A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population).
A statistical model represents, often in considerably idealized form, the data-generating process.
[1] When referring specifically to probabilities, the corresponding term is probabilistic model.
As such, a statistical model is "a formal representation of a theory" (Herman Adèr quoting Kenneth Bollen).
As an example, consider a pair of ordinary six-sided dice.
The first statistical assumption is this: for each of the dice, the probability of each face (1, 2, 3, 4, 5, and 6) coming up is 1/6.
In the example above, with the first assumption, calculating the probability of an event is easy.
With some other examples, though, the calculation can be difficult, or even impractical (e.g. it might require millions of years of computation).
For an assumption to constitute a statistical model, such difficulty is acceptable: doing the calculation does not need to be practicable, just theoretically possible.
In mathematical terms, a statistical model is a pair (
We could formalize that relationship in a linear regression model, like this: heighti = b0 + b1agei + εi, where b0 is the intercept, b1 is a parameter that age is multiplied by to obtain a prediction of height, εi is the error term, and i identifies the child.
Thus, a straight line (heighti = b0 + b1agei) cannot be admissible for a model of the data—unless it exactly fits all the data points, i.e. all the data points lie perfectly on the line.
The error term, εi, must be included in the equation, so that the model is consistent with all the data points.
To do statistical inference, we would first need to assume some probability distributions for the εi.
In this instance, the model would have 3 parameters: b0, b1, and the variance of the Gaussian distribution.
, of our model comprises the set of all possible pairs (age, height).
Choosing an appropriate statistical model to represent a given data-generating process is sometimes extremely difficult, and may require knowledge of both the process and relevant statistical analyses.
Relatedly, the statistician Sir David Cox has said, "How [the] translation from subject-matter problem to statistical model is done is often the most critical part of an analysis".
[citation needed] As an example, if we assume that data arise from a univariate Gaussian distribution, then we are assuming that In this example, the dimension, k, equals 2.
As another example, suppose that the data consists of points (x, y) that we assume are distributed according to a straight line with i.i.d.
Gaussian residuals (with zero mean): this leads to the same statistical model as was used in the example with children's heights.
is formally a single parameter with dimension 2, but it is often regarded as comprising 2 separate parameters—the mean and the standard deviation.
A statistical model is nonparametric if the parameter set
A statistical model is semiparametric if it has both finite-dimensional and infinite-dimensional parameters.
Regarding semiparametric and nonparametric models, Sir David Cox has said, "These typically involve fewer assumptions of structure and distributional form but usually contain strong assumptions about independencies".
As an example where they have the same dimension, the set of positive-mean Gaussian distributions is nested within the set of all Gaussian distributions; they both have dimension 2.
Konishi & Kitagawa (2008, p. 75) state: "The majority of the problems in statistical inference can be considered to be problems related to statistical modeling.
They are typically formulated as comparisons of several statistical models."
Common criteria for comparing models include the following: R2, Bayes factor, Akaike information criterion, and the likelihood-ratio test together with its generalization, the relative likelihood.
Another way of comparing two statistical models is through the notion of deficiency introduced by Lucien Le Cam.