Ensemble learning

[1][2][3] Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble consists of only a concrete finite set of alternative models, but typically allows for much more flexible structure to exist among those alternatives.

The set of weak models — which would not produce satisfactory predictive results individually — are combined or averaged to produce a single, high performing, accurate, and low-variance model to fit the task as required.

Ensemble learning typically refers to bagging (bootstrap aggregating), boosting or stacking/blending techniques to induce high variance among the base models.

[5] The broader term of multiple classifier systems also covers hybridization of hypotheses that are not induced by the same base learner.

By analogy, ensemble techniques have been used also in unsupervised learning scenarios, for example in consensus clustering or in anomaly detection.

Empirically, ensembles tend to yield better results when there is a significant diversity among the models.

[10] Using a variety of strong learning algorithms, however, has been shown to be more effective than using techniques that attempt to dumb-down the models in order to promote diversity.

[14] Ensemble learning, including both regression and classification tasks, can be explained using a geometric framework.

[15] Within this framework, the output of each individual classifier or regressor for the entire dataset can be viewed as a point in a multi-dimensional space.

Their theoretical framework shows that using the same number of independent component classifiers as class labels gives the highest accuracy.

[18] The Naive Bayes classifier is a version of this that assumes that the data is conditionally independent on the class and makes the computation more feasible.

The most common implementation of boosting is Adaboost, but some newer algorithms are reported to achieve better results.

R packages ensembleBMA[23] and BMA[24] use the prior implied by the Bayesian information criterion, (BIC), following Raftery (1995).

Large-sample asymptotic theory establishes that if there is a best model, then with increasing sample sizes, BIC is strongly consistent, i.e., will almost certainly find it, while AIC may not, because AIC may continue to place excessive posterior probability on models that are more complicated than they need to be.

On the other hand, AIC and AICc are asymptotically "efficient" (i.e., minimum mean square prediction error), while BIC is not .

[28] Burnham and Anderson (1998, 2002) contributed greatly to introducing a wider audience to the basic ideas of Bayesian model averaging and popularizing the methodology.

[29] The availability of software, including other free open-source packages for R beyond those mentioned above, helped make the methods accessible to a wider audience.

This modification overcomes the tendency of BMA to converge toward giving all the weight to a single model.

Typically, none of the models in the ensemble are exactly the distribution from which the training data were generated, so all of them correctly receive a value close to zero for this term.

Likewise, the results from BMC may be approximated by using cross-validation to select the best ensemble combination from a random sampling of possible weightings.

[49] Some of the applications of ensemble classifiers include: Land cover mapping is one of the major applications of Earth observation satellite sensors, using remote sensing and geospatial data, to identify the materials and objects which are located on the surface of target areas.

Generally, the classes of target materials include roads, buildings, rivers, lakes, and vegetation.

[50] Some different ensemble learning approaches based on artificial neural networks,[51] kernel principal component analysis (KPCA),[52] decision trees with boosting,[53] random forest[50][54] and automatic design of multiple classifier systems,[55] are proposed to efficiently identify land cover objects.

Change detection is widely used in fields such as urban growth, forest and vegetation dynamics, land use and disaster monitoring.

[56] The earliest applications of ensemble classifiers in change detection are designed with the majority voting,[57] Bayesian model averaging,[58] and the maximum posterior probability.

[60] One example is a Bayesian ensemble changepoint detection method called BEAST, with the software available as a package Rbeast in R, Python, and Matlab.

Ensemble learning successfully aids such monitoring systems to reduce their total error.

Because ensemble learning improves the robustness of the normal behavior modelling, it has been proposed as an efficient technique to detect such fraudulent cases and activities in banking and credit card systems.

[77][78] The accuracy of prediction of business failure is a very crucial issue in financial decision-making.

[79] Ensemble classifiers have been successfully applied in neuroscience, proteomics and medical diagnosis like in neuro-cognitive disorder (i.e. Alzheimer or myotonic dystrophy) detection based on MRI datasets[80][81][82], cervical cytology classification[83][84].