https://kdd.org/exploration_files/V15-01-01-Freitas.pdf

Background on Decision Trees

a Decision Tree is a type of supervised learning algorithm that is mostly used for classification problems but can also be used for regression problems. It works for both categorical and continuous input and output variables.

A Decision Tree is a tree-like model of decisions where each node represents a feature (or attribute), each link (branch) represents a decision rule, and each leaf represents an outcome.

Feature Selection: The decision of which feature to split on at each step in the tree is based on a metric. Two common metrics are Gini Impurity and Information Gain (which involves entropy).

Entropy: Entropy is a measure of uncertainty, randomness, or impurity. In the context of Decision Trees, entropy provides a measure of the impurity of the input set.

In an ideal case, a decision tree would be able to split in such a way that perfectly separates the classes in the output variable. This would mean that each leaf node only contains datapoints belonging to one class, making them pure.

For a binary classification problem, the entropy of a node can be calculated as: Entropy = -p*log2(p) - q*log2(q) where p and q are the proportions of the data that belong to each class.

If a node is pure (all data points belong to one class), the entropy is 0, because there's no uncertainty or randomness left. If the data is evenly split between classes, the entropy is 1, indicating maximum uncertainty or randomness.

Information Gain: Information Gain measures the reduction in entropy achieved because of the split. The feature providing the highest Information Gain is selected as the splitting attribute.

Here's how you compute it: Information Gain = Entropy(Parent) - Weighted Sum of Entropy(Children)

The algorithm will calculate the Information Gain for every feature and choose the one with the largest Information Gain as the splitting feature at each node.

This process is repeated recursively until the tree is fully grown. The recursion can be stopped by imposing conditions such as maximum depth of the tree, minimum samples in a node, etc., to prevent overfitting.

In Decision Trees, we want to place the features that give us the most "information" near the top of the tree, thus creating a shorter and more efficient tree. The Information Gain metric helps us determine which features give us the most information by choosing the ones that minimize the entropy (randomness) after the split.

In other words, we are looking for features that create child nodes that are as pure as possible, meaning they contain data points mostly from one class. These features have higher Information Gain because they help us make more confident predictions.

In terms of the parent and child entropy, we ideally want high entropy at the parent node and low entropy at the child nodes. Here's why:

  1. High entropy at the parent node means that the current (unsplit) data is quite mixed between the different classes.
  2. Low entropy at the child nodes means that, after splitting on a certain feature, the resulting groups (child nodes) are mostly pure or majorly from a single class.

By maximizing the Information Gain, which is Entropy(Parent) - Weighted Sum of Entropy(Children), we are essentially trying to find a feature that will give us child nodes with as low entropy as possible. This, in turn, means we're finding groups that are easier to make a decision for because they contain data points that majorly belong to a single class. This is the intuition behind the math.