So a typical GAM might use a scatterplot smoothing function, such as a locally weighted mean, for f1(x1), and then use a factor model for f2(x2).
Certain constructive proofs exist, but they tend to require highly complicated (i.e. fractal) functions, and thus are not suitable for modeling approaches.
Therefore, the generalized additive model[1] drops the outer sum, and demands instead that the function belong to a simpler class, where
are represented using smoothing splines[6] then the degree of smoothness can be estimated as part of model fitting using generalized cross validation, or by restricted maximum likelihood (REML, sometimes known as 'GML') which exploits the duality between spline smoothers and Gaussian random effects.
More recent methods have addressed this computational cost either by up front reduction of the size of the basis used for smoothing (rank reduction)[8][9][10][11][12] or by finding sparse representations of the smooths using Markov random fields, which are amenable to the use of sparse matrix methods for computation.
[13] These more computationally efficient methods use GCV (or AIC or similar) or REML or take a fully Bayesian approach for inference about the degree of smoothness of the model components.
An alternative approach with particular advantages in high dimensional settings is to use boosting, although this typically requires bootstrapping for uncertainty quantification.
is chosen to be sufficiently large that we expect it to overfit the data to hand (thereby avoiding bias from model over-simplification), but small enough to retain computational efficiency.
without changing the model predictions at all), so identifiability constraints have to be imposed on the smooth terms to remove this ambiguity.
to minimize where the integrated square second derivative penalties serve to penalize wiggliness (lack of smoothness) of the
padded with zeros so that the second equality holds and we can write the penalty in terms of the full coefficient vector
However, if smoothing parameters are selected appropriately the (squared) smoothing bias introduced by penalization should be less than the reduction in variance that it produces, so that the net effect is a reduction in mean square estimation error, relative to not penalizing.
is the GLM scale parameter introduced only for later convenience), but we can immediately recognize this as a multivariate normal prior with mean
[22] Finally we may choose to maximize the Marginal Likelihood (REML) obtained by integrating the model coefficients,
The preceding integral is usually analytically intractable but can be approximated to quite high accuracy using Laplace's method.
For example, to optimize a GCV or marginal likelihood typically requires numerical optimization via a Newton or Quasi-Newton method, with each trial value for the (log) smoothing parameter vector requiring a penalized IRLS iteration to evaluate the corresponding
[26] The INLA software implements a fully Bayesian approach based on Markov random field representations exploiting sparse matrix methods.
gam sets up bases and penalties for the smooth terms, estimates the model including its smoothing parameters and, in standard R fashion, returns a fitted model object, which can then be interrogated using various helper functions, such as summary, plot, predict, and AIC.
For example a Gaussian distribution and identity link has been assumed, and the smoothing parameter selection criterion was GCV.
Also the smooth terms were represented using `penalized thin plate regression splines', and the basis dimension for each was set to 10 (implying a maximum of 9 degrees of freedom after identifiability constraints have been imposed).
The specification of distribution and link function uses the `family' objects that are standard when fitting GLMs in R or S. Note that Gaussian random effects can also be added to the linear predictor.
Note that since GLMs and GAMs can be estimated using Quasi-likelihood, it follows that details of the distribution of the residuals beyond the mean-variance relationship are of relatively minor importance.
One issue that is more common with GAMs than with other GLMs is a danger of falsely concluding that data are zero inflated.
The difficulty arises when data contain many zeroes that can be modelled by a Poisson or binomial with a very low expected value: the flexibility of the GAM structure will often allow representation of a very low mean over some region of covariate space, but the distribution of standardized residuals will fail to look anything like the approximate normality that introductory GLM classes teach us to expect, even if the model is perfectly correct.
Each extra penalty has its own smoothing parameter and estimation then proceeds as before, but now with the possibility that terms will be completely penalized to zero.
[28] In high dimensional settings then it may make more sense to attempt this task using the lasso or elastic net regularization.
Basing AIC on the marginal likelihood in which only the penalized effects are integrated out is possible (the number of un-penalized coefficients now gets added to the parameter count for the AIC penalty), but this version of the marginal likelihood suffers from the tendency to oversmooth that provided the original motivation for developing REML.
[1][22] Naive versions of the conditional AIC have been shown to be much too likely to select larger models in some circumstances, a difficulty attributable to neglect of smoothing parameter uncertainty when computing the effective degrees of freedom,[29] however correcting the effective degrees of freedom for this problem restores reasonable performance.
Cross-validation can be used to detect and/or reduce overfitting problems with GAMs (or other statistical methods),[30] and software often allows the level of penalization to be increased to force smoother fits.
Estimating very large numbers of smoothing parameters is also likely to be statistically challenging, and there are known tendencies for prediction error criteria (GCV, AIC etc.)