Decision Tree (ID3)

Decision Tree, ID3, Entropy, Information Gain, Classification

Launch Solver →

A Decision Tree helps us make predictions by mapping out different choices in a tree-like structure. It works like a game of 20 questions, automatically figuring out the best questions to ask to split the data. The ID3 (Iterative Dichotomiser 3) version specifically uses the concepts of Entropy and Information Gain to mathematically decide which feature separates the data the cleanest at every step.

Entropy & Information Gain Formulas

Entropy(S)=PP+Nlog2(PP+N)NP+Nlog2(NP+N)Entropy(S) = - \frac{P}{P+N} \log_2\left(\frac{P}{P+N}\right) - \frac{N}{P+N} \log_2\left(\frac{N}{P+N}\right)

What do these variables mean?

  • PNumber of positive labels (e.g., 'Yes' or 'Up').
  • NNumber of negative labels (e.g., 'No' or 'Down').
  • Log₂Base-2 Logarithm.
  • Information GainEntropy(Target) - Entropy(Feature).
  • Feature EntropyThe weighted sum of the entropies of its individual values: [pi+niP+N]Entropy(value)\sum \left[ \frac{p_i + n_i}{P + N} \right] \cdot \text{Entropy}(\text{value})

How Does it Work?

1

Calculate the Entropy of your main Target Class based on total Positives and Negatives.

2

For every feature column, list its unique values. Count the Positives (pi) and Negatives (ni) for each value, and calculate their individual entropies.

3

Calculate the total Feature Entropy by taking the weighted average of those individual value entropies.

4

Calculate the Information Gain for the feature by subtracting the Feature Entropy from the Target Class Entropy.

5

Pick the feature with the Highest Information Gain! This becomes your Root Node.

6

Draw branches for each unique value of that feature. If a branch leads to pure data (all Yes or all No), cap it with a Leaf Node.

7

If a branch has a mix of Yes and No (confusion), filter the dataset to only include rows for that branch, remove the feature you just used, and repeat the whole process from Step 1!

Important Rules & Conventions

  • Exam Trick 1 (Pure Node): If a target column's values are all the same (e.g., 5 Yes, 0 No), do not calculate. Entropy is exactly 0!
  • Exam Trick 2 (Equal Split): If a target column has an exact 50/50 split (e.g., 4 Yes, 4 No), do not calculate. Entropy is exactly 1!
  • When recursing down a branch due to confusion, physically cross out the column you just used. You never calculate Gain for a feature twice in the same path.

Advantages

  • Incredibly visual and easy to explain the decision-making process to non-technical people.
  • Does not require feature scaling or normalization (it doesn't care if a number is 10 or 10,000, only how it splits).
  • Handles non-linear relationships and missing data very effectively.

Disadvantages

  • × Highly prone to 'Overfitting'. If you let a tree grow too deep, it memorizes the training data perfectly but fails terribly on new data.
  • × Instability: A tiny change in the training dataset can cause a completely different feature to be chosen as the root, altering the entire tree.
  • × Building trees with continuous numeric data (like exact salaries) is computationally expensive because it has to test many split thresholds.

Summary

The ID3 Decision Tree algorithm is all about reducing uncertainty. By calculating Entropy and Information Gain, it systematically figures out which feature splits your dataset into the purest groups. While prone to overfitting without techniques like pruning, it remains one of the most intuitive models in machine learning.