Decision tree pruning

Pruning is a data compression technique in machine learning and search algorithms that reduces the size of decision trees by removing sections of the tree that are non-critical and redundant to classify instances.

Pruning reduces the complexity of the final classifier, and hence improves predictive accuracy by the reduction of overfitting.

A tree that is too large risks overfitting the training data and poorly generalizing to new samples.

A small tree might not capture important structural information about the sample space.

However, it is hard to tell when a tree algorithm should stop because it is impossible to tell if the addition of a single extra node will dramatically decrease error.

There are many techniques for tree pruning that differ in the measurement that is used to optimize performance.

Pre-pruning methods are considered to be more efficient because they do not induce an entire set, but rather trees remain small from the start.

Pruning can not only significantly reduce the size but also improve the classification accuracy of unseen objects.

The procedures are differentiated on the basis of their approach in the tree (top-down or bottom-up).

By pruning the tree at an inner node, it can happen that an entire sub-tree (regardless of its relevance) is dropped.

One of these representatives is pessimistic error pruning (PEP), which brings quite good results with unseen items.

While somewhat naive, reduced error pruning has the advantage of simplicity and speed.

Pruning could be applied in a compression scheme of a learning algorithm to remove the redundant details without compromising the model's performances.

Before and after pruning