Decision tree learning

In this formalism, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of observations.

More generally, the concept of regression tree can be extended to any kind of object equipped with pairwise dissimilarities such as categorical sequences.

[1] Decision trees are among the most popular machine learning algorithms given their intelligibility and simplicity.

While less expressive, decision lists are arguably easier to understand than general decision trees due to their added sparsity[citation needed], permit non-greedy learning methods[15] and monotonic constraints to be imposed.

Boosted ensembles of FDTs have been recently investigated as well, and they have shown performances comparable to those of other very efficient fuzzy classifiers.

[24] Another type of decision tree addresses ordinal classification problems, where class values follow a natural order, such as those related to pricing or costs.

Typically, these problems are treated as multi-class classification tasks, disregarding the inherent ordering of the classes.

[25] Algorithms for constructing decision trees usually work top-down, by choosing a variable at each step that best splits the set of items.

These metrics are applied to each candidate subset, and the resulting values are combined (e.g., averaged) to provide a measure of the quality of the split.

Depending on the underlying metric, the performance of various heuristic algorithms for decision tree learning may vary significantly.

To combat this, one could use a more powerful metric known as Sensitivity that takes into account the proportions of the values from the confusion matrix to give the actual true positive rate (TPR).

Depending on the situation and knowledge of the data and decision trees, one may opt to use the positive estimate for a quick and easy solution to their problem.

On the other hand, a more experienced user would most likely prefer to use the TPR value to rank the features because it takes into account the proportions of the data and all the samples that should have been classified as positive.

It reaches its minimum (zero) when all cases in the node fall into a single target category.

, which in physics is associated with the lack of information in out-of-equilibrium, non-extensive, dissipative and quantum systems.

In this sense, the Gini impurity is nothing but a variation of the usual entropy measure for decision trees.

are fractions that add up to 1 and represent the percentage of each class present in the child node that results from a split in the tree.

Information gain is used to decide which feature to split on at each step in building the tree.

[28] Consider an example data set with four attributes: outlook (sunny, overcast, rainy), temperature (hot, mild, cool), humidity (high, normal), and windy (true, false), with a binary (yes or no) target variable, play, and 14 data points.

Thus we have To find the information of the split, we take the weighted average of these two numbers based on how many observations fell into which node.

Introduced in CART,[7] variance reduction is often employed in cases where the target variable is continuous (regression tree), meaning that use of many other metrics would first require discretization before being applied.

Each of the above summands are indeed variance estimates, though, written in a form without directly referring to the mean.

, the variance reduction criterion applies to any kind of object for which pairwise dissimilarities can be computed.

Consider an example data set with three attributes: savings(low, medium, high), assets(low, medium, high), income(numerical value), and a binary target variable credit risk(good, bad) and 8 data points.

To build the tree, the "goodness" of all candidate splits for the root node need to be calculated.

Compared to other metrics such as information gain, the measure of "goodness" will attempt to create a more balanced tree, leading to more-consistent decision time.

However, it sacrifices some priority for creating pure children which can lead to additional splits that are not present with other metrics.

In a decision graph, it is possible to use disjunctions (ORs) to join two more paths together using minimum message length (MML).

[45] The more general coding scheme results in better predictive accuracy and log-loss probabilistic scoring.

[49] Or several trees can be constructed parallelly to reduce the expected number of tests till classification.