ID3 algorithm

In decision tree learning, ID3 (Iterative Dichotomiser 3) is an algorithm invented by Ross Quinlan[1] used to generate a decision tree from a dataset.

ID3 is the precursor to the C4.5 algorithm, and is typically used in the machine learning and natural language processing domains.

It then selects the attribute which has the smallest entropy (or largest information gain) value.

is then split or partitioned by the selected attribute to produce subsets of the data.

The algorithm continues to recurse on each subset, considering only attributes never selected before.

Recursion on a subset may stop in one of these cases: Throughout the algorithm, the decision tree is constructed with each non-terminal node (internal node) representing the selected attribute on which the data was split, and terminal nodes (leaf nodes) representing the class label of the final subset of this branch.

It uses a greedy strategy by selecting the locally best attribute to split the dataset on each iteration.

To avoid overfitting, smaller decision trees should be preferred over larger ones.

to produce a decision tree which is stored in memory.

At runtime, this decision tree is used to classify new test cases (feature vectors) by traversing the decision tree using the features of the datum to arrive at a leaf node.

is a measure of the amount of uncertainty in the (data) set

Entropy in information theory measures how much information is expected to be gained upon measuring a random variable; as such, it can also be used to quantify the amount to which the distribution of the quantity's values is unknown.

A constant quantity has zero entropy, as its distribution is perfectly known.

Therefore, the greater the entropy at a node, the less information is known about the classification of data at this stage of the tree; and therefore, the greater the potential to improve the classification here.

As such, ID3 is a greedy heuristic performing a best-first search for locally optimal entropy values.

Where, In ID3, information gain can be calculated (instead of entropy) for each remaining attribute.

The attribute with the largest information gain is used to split the set