Confusion Matrix: A Complete Solved Numerical Example
Scenario: Fraudulent Transaction Detection
The Objective: A bank's fraud detection model was evaluated on 200 recent transactions. The results are recorded below.
Core Mechanics▼
- Decoding the Acronyms: Read them backward! The second letter (P or N) is what your model guessed. The first letter (T or F) tells you if that guess was True (Right) or False (Wrong).
- The Accuracy Trap: High accuracy is dangerously misleading if your classes are imbalanced. A lazy model can reach 99% accuracy by just guessing the majority class every single time, while missing every single rare case you actually care about.
- Precision (Quality Control): Of everything your model claimed was Positive, how many actually were? It measures how much you can trust your model when it screams "Positive!"—too many false alarms destroy this score.
- Recall (The Dragnet): Of all the actual Positives hidden in the real data, how many did you catch? Recall is about your model's ability to hunt down every single target; missing real ones destroys this score.
Step 1: The Confusion Matrix
Here are the raw results from evaluating the model.Positive Class = Fraudulent | Negative Class = Legitimate
Step 2: Step-by-Step Calculation Breakdown
Here is exactly how each metric was derived using the core confusion matrix formulas.
Accuracy
Out of all predictions, how many were perfectly correct?
Precision
When the AI predicted 'Yes', how often was it actually right?
Recall (Sensitivity)
Out of all the actual 'Yes' cases, how many did the AI successfully find?
F1 Score
The harmonic mean. It forces a balance between Precision and Recall.
Step 3: Evaluation Metrics Summary
Using the values above, we calculate the four primary performance metrics for this classification model.
Final Takeaway
The positive class represents a Fraudulent transaction. A False Negative (missed fraud) is significantly more costly than a False Positive (a legitimate transaction flagged for review). Notice the tension between the metrics! Because the model's Precision (72.7%) and Recall (80%) are noticeably different, you cannot just add them up and divide by two. You must calculate the F1 Score (76.19%) to find their true harmonic mean, which is a classic exam trap!