K-Means Clustering
K-Means, Unsupervised Learning, Clustering, Centroids, Euclidean Distance, Iterative
K-Means is a powerhouse of 'Unsupervised Learning'. Unlike previous algorithms where we had a target Class Label to predict, K-Means looks at raw, unlabeled data and groups it into hidden patterns all by itself. It works by placing central anchor points (called 'Centroids') into the data, and iteratively shuffling them around until data points are neatly separated into $k$ distinct clusters. It is heavily used in the real world for things like customer segmentation and image compression.
The Distance Formula
What do these variables mean?
- The Euclidean distance. This calculates the straight-line distance between a data point and a centroid.
- The coordinates of the specific data point you are currently checking.
- The coordinates of the cluster's centroid (center point).
- The FormulaIt is just the Pythagorean theorem! You subtract the 's, square them, subtract the 's, square them, add them together, and take the square root to find the exact distance.
How Does it Work?
Initialize : Decide how many clusters you want to form.
Place Initial Centroids: Pick data points from your dataset to act as your starting cluster centers (Centroids).
Step 1 (Expectation): Calculate the Euclidean distance between every single data point and all available centroids.
Assign Clusters: Look at the distances for each point. Assign the point to the cluster of the centroid it is closest to.
Step 2 (Maximization): Now that points are grouped, calculate the *new* centroid for each cluster by finding the Mean (average) of all the and values assigned to it.
Check for Convergence: Did any data points change their cluster assignment from the last iteration? If yes, go back to Step 1 with your new centroids. If no, you are done!
Important Rules & Conventions
- Exam Trick 1: Draw a massive table! List your data points down the left, and make a column for Distance to , Distance to , etc., and a final column for the 'Assigned Cluster'. It prevents getting lost in the math.
- Exam Trick 2: If a cluster loses all of its points during an iteration (an empty cluster), standard exam protocol is to just leave that centroid exactly where it is for the next round.
- Remember: The algorithm doesn't know what the clusters 'mean' (e.g., it doesn't know it found 'Tall People' vs 'Short People'). It only knows they are mathematically close together.
Advantages
- ✓ Incredibly fast and computationally efficient, making it highly scalable to massive real-world datasets.
- ✓ Simple to understand and implement from scratch.
- ✓ Generalizes well to clusters of different shapes and sizes.
Disadvantages
- × You have to manually guess or choose the value of $k$ before running the algorithm.
- × Extremely sensitive to your starting centroids. A bad random starting point can lead to terrible clusters and longer calculation times.
- × Highly sensitive to outliers. A single extreme outlier can pull a centroid way off course because K-Means relies on the mean (average).
Summary
K-Means is the ultimate 'sorting hat' for unlabeled data. By simply measuring distances and recalculating averages in a repetitive loop, it effortlessly groups similar data points together. While it struggles with massive outliers, its sheer speed and simplicity make it the most popular clustering algorithm in data science.