Apriori Algorithm

Apriori, Association Rules, Market Basket Analysis, Support, Confidence, Frequent Itemsets

Launch Solver →

The Apriori algorithm is the foundation of recommendation engines (like 'Customers who bought this also bought...'). It is an unsupervised learning technique that searches through massive datasets to find hidden relationships between items. Instead of predicting a specific number or class, it generates 'Association Rules'. It operates on a very logical principle: if a combination of items is frequent, then all of its individual subsets must also be frequent. By aggressively filtering out infrequent items early on, it saves the computer from having to calculate every possible combination in the universe.

Support & Confidence

Support(X)=Count of XTotal TransactionsConfidence(XY)=Count of (X and Y)Count of X\begin{aligned} Support(X) &= \frac{\text{Count of } X}{\text{Total Transactions}} \\[2em] Confidence(X \rightarrow Y) &= \frac{\text{Count of } (X \text{ and } Y)}{\text{Count of } X} \end{aligned}

What do these variables mean?

  • SupportSupportHow popular an itemset is. If it appears in 3 out of 10 transactions, the support is 30%. You use this to filter out rare items.
  • ConfidenceConfidenceHow reliable a rule is. 'If a person buys X, how likely are they to buy Y?'
  • XYX \rightarrow YThis is an 'Association Rule'. XX is the 'If' (Antecedent), and YY is the 'Then' (Consequent).
  • The Confidence FormulaIt simply asks: 'Out of all the times XX happened, how many times did YY happen alongside it?'

How Does it Work?

1

Set Thresholds: Define your Minimum Support (e.g., must appear in 3 transactions) and Minimum Confidence (e.g., rule must be 80% accurate).

2

Find 1-Itemsets (L1): Count every individual item across all transactions. Drop any item that doesn't meet the Minimum Support.

3

Form Pairs (L2): Take the surviving items from L1 and pair them up. Count how often these pairs appear *together* in the same transaction. Drop pairs below the Minimum Support.

4

Form Triplets (L3) & Prune: Combine surviving pairs into triplets. *Crucial Step:* Before counting, check if every 2-item subset of the triplet survived L2. If even one pair failed, drop the triplet immediately! Then, count the remaining triplets and filter by Support.

5

Repeat: Keep increasing the size (L4, L5...) until you can't form any more valid combinations.

6

Generate Rules: Take your final frequent itemsets, create all possible 'If/Then' permutations, and calculate their Confidence. Keep only the rules that beat your Minimum Confidence threshold!

Important Rules & Conventions

  • Exam Trick 1: When calculating Confidence on an exam, you don't need percentages! Just use the raw frequency counts. Confidence=Count(X and Y)/Count(X)Confidence = \text{Count}(X \text{ and } Y) / \text{Count}(X). It is much faster and prevents rounding errors.
  • Exam Trick 2: Remove duplicates inside a single row first! If a transaction is listed as {apple, apple, banana}, you treat it strictly as {apple, banana}. Apriori only cares IF an item is in the basket, not how many of them.
  • Exam Trick 3: The Pruning Rule saves your life. If {A, B} was dropped in step 2, then {A, B, C} is mathematically impossible to be frequent. Don't even bother counting it in step 3!

Advantages

  • Extremely intuitive results. The final 'If A, then B' rules are incredibly easy to explain to non-technical business managers.
  • The 'Apriori Property' (pruning) drastically reduces the amount of math required compared to blindly testing every possible combination of items.
  • Requires absolutely no labeled training data (purely unsupervised).

Disadvantages

  • × Requires multiple scans of the database. If you have 10 million transactions, scanning the database over and over for L1, L2, L3... is computationally brutal.
  • × Generates an astronomical number of candidates in the early steps (like C2) if the dataset has a lot of unique items.
  • × If you set your minimum thresholds too low, the algorithm will bury you in thousands of useless or 'obvious' rules (e.g., 'If Bread, then Milk').

Summary

Apriori is a filtering algorithm built on the power of subsets. By systematically counting items and aggressively pruning combinations that fall below a support threshold, it distills thousands of transactions down to a few highly confident, actionable rules. It is the grandfather of modern market basket analysis.