However, in the context of decision trees, the term is sometimes used synonymously with mutual information, which is the conditional expected value of the Kullback–Leibler divergence of the univariate probability distribution of one variable from the conditional distribution of this variable given the other one.
In machine learning, this concept can be used to define a preferred sequence of attributes to investigate to most rapidly narrow down the state of X.
Such a sequence (which depends on the outcome of the investigation of previous attributes at each stage) is called a decision tree, and when applied in the area of machine learning is known as decision tree learning.
This is intuitively plausible when interpreting entropy Η as a measure of uncertainty of a random variable
The information gain for an attribute a is defined in terms of Shannon entropy
defines a partition of the training set data T into mutually exclusive and all-inclusive subsets, inducing a categorical probability distribution
In this representation, the information gain of T given a can be defined as the difference between the unconditional Shannon entropy of T and the expected entropy of T conditioned on a, where the expectation value is taken with respect to the induced distribution on the values of a.
Basically, entropy is the measure of impurity or uncertainty in a group of observations.
[1] The leftmost figure below is very impure and has high entropy corresponding to higher disorder and lower information value.
In this case, the parent node t has a collection of cancer and non-cancer samples denoted as C and NC respectively.
We can use information gain to determine how good the splitting of nodes is in a decision tree.
In terms of entropy, information gain is defined as: To understand this idea, let's start by an example in which we create a simple dataset and want to see if gene mutations could be related to patients with cancer.
A sample with C denotes that it has been confirmed to be cancerous, while NC means it is non-cancerous.
Using this data, a decision tree can be created with information gain used to determine the candidate splits for each node.
However, such a set of data is good for learning the attributes of the mutations used to split the node.
At a certain node, when the homogeneity of the constituents of the input occurs (as shown in the rightmost figure in the above Entropy Diagram), the dataset would no longer be good for learning.
Moving on, the entropy at left and right child nodes of the above decision tree is computed using the formulae:H(tL) = −[pC,L log2(pC,L) + pNC,L log2(pNC,L)][1]H(tR) = −[pC,R log2(pC,R) + pNC,R log2(pNC,R)][1]where, probability of selecting a class ‘C’ sample at the left child node, pC,L = n(tL, C) / n(tL), probability of selecting a class ‘NC’ sample at the left child node, pNC,L = n(tL, NC) / n(tL), probability of selecting a class ‘C’ sample at the right child node, pC,R = n(tR, C) / n(tR), probability of selecting a class ‘NC’ sample at the right child node, pNC,R = n(tR, NC) / n(tR), n(tL), n(tL, C), and n(tL, NC) are the total number of samples, ‘C’ samples and ‘NC’ samples at the left child node respectively, n(tR), n(tR, C), and n(tR, NC) are the total number of samples, ‘C’ samples and ‘NC’ samples at the right child node respectively.
Following this, average entropy of the child nodes due to the split at node t of the above decision tree is computed as:H(s,t) = PLH(tL) + PRH(tR) where, probability of samples at the left child, PL = n(tL) / n(t),
Thus, by definition from equation (i):(Information gain) = H(t) - H(s,t)After all the steps, gain(s), where s is a candidate split for the example is: Using this same set of formulae with the other three mutations leads to a table of the candidate splits, ranked by their information gain: The mutation that provides the most useful information would be Mutation 3, so that will be used to split the root node of the decision tree.
The root can be split and all the samples can be passed though and appended to the child nodes.
To remedy this, the tree can be split again at the child nodes to possibly achieve something even more accurate.
The best course of action is to try testing the tree on other samples, of which are not part of the original set.
Information gain is the basic criterion to decide whether a feature should be used to split a node or not.
[1] Some of its advantages include: Although information gain is usually a good measure for deciding the relevance of an attribute, it is not perfect.
A notable problem occurs when information gain is applied to attributes that can take on a large number of distinct values.
For example, suppose that one is building a decision tree for some data describing the customers of a business.
Information gain is often used to decide which of the attributes are the most relevant, so they can be tested near the root of the tree.
This attribute has a high mutual information, because it uniquely identifies each customer, but we do not want to include it in the decision tree.
This issue can also occur if the samples that are being tested have multiple attributes with many distinct values.
In this case, it can cause the information gain of each of these attributes to be much higher than those without as many distinct values.