K-Means Clustering

K-Means, Unsupervised Learning, Clustering, Centroids, Euclidean Distance, Iterative

Launch Solver →

K-Means is a powerhouse of 'Unsupervised Learning'. Unlike previous algorithms where we had a target Class Label to predict, K-Means looks at raw, unlabeled data and groups it into hidden patterns all by itself. It works by placing central anchor points (called 'Centroids') into the data, and iteratively shuffling them around until data points are neatly separated into $k$ distinct clusters. It is heavily used in the real world for things like customer segmentation and image compression.

The Distance Formula

d=(X2X1)2+(Y2Y1)2d = \sqrt{(X_2 - X_1)^2 + (Y_2 - Y_1)^2}

What do these variables mean?

  • ddThe Euclidean distance. This calculates the straight-line distance between a data point and a centroid.
  • X2,Y2X_2, Y_2The coordinates of the specific data point you are currently checking.
  • X1,Y1X_1, Y_1The coordinates of the cluster's centroid (center point).
  • The FormulaIt is just the Pythagorean theorem! You subtract the XX's, square them, subtract the YY's, square them, add them together, and take the square root to find the exact distance.

How Does it Work?

1

Initialize kk: Decide how many clusters you want to form.

2

Place Initial Centroids: Pick kk data points from your dataset to act as your starting cluster centers (Centroids).

3

Step 1 (Expectation): Calculate the Euclidean distance between every single data point and all available centroids.

4

Assign Clusters: Look at the distances for each point. Assign the point to the cluster of the centroid it is closest to.

5

Step 2 (Maximization): Now that points are grouped, calculate the *new* centroid for each cluster by finding the Mean (average) of all the XX and YY values assigned to it.

6

Check for Convergence: Did any data points change their cluster assignment from the last iteration? If yes, go back to Step 1 with your new centroids. If no, you are done!

Important Rules & Conventions

  • Exam Trick 1: Draw a massive table! List your data points down the left, and make a column for Distance to C1C_1, Distance to C2C_2, etc., and a final column for the 'Assigned Cluster'. It prevents getting lost in the math.
  • Exam Trick 2: If a cluster loses all of its points during an iteration (an empty cluster), standard exam protocol is to just leave that centroid exactly where it is for the next round.
  • Remember: The algorithm doesn't know what the clusters 'mean' (e.g., it doesn't know it found 'Tall People' vs 'Short People'). It only knows they are mathematically close together.

Advantages

  • Incredibly fast and computationally efficient, making it highly scalable to massive real-world datasets.
  • Simple to understand and implement from scratch.
  • Generalizes well to clusters of different shapes and sizes.

Disadvantages

  • × You have to manually guess or choose the value of $k$ before running the algorithm.
  • × Extremely sensitive to your starting centroids. A bad random starting point can lead to terrible clusters and longer calculation times.
  • × Highly sensitive to outliers. A single extreme outlier can pull a centroid way off course because K-Means relies on the mean (average).

Summary

K-Means is the ultimate 'sorting hat' for unlabeled data. By simply measuring distances and recalculating averages in a repetitive loop, it effortlessly groups similar data points together. While it struggles with massive outliers, its sheer speed and simplicity make it the most popular clustering algorithm in data science.