K-Means Clustering

## At a Glance - **Definition:** An unsupervised algorithm that partitions points into $K$ groups by alternately assigning each point to its nearest centroid, then moving each centroid to the mean of its assigned points. - **Formula:** minimise $J = \sum_{i=1}^{n} \lVert x_i - \mu_{c_i} \rVert^2$ - **Range / output:** returns a hard cluster label $c_i \in \{1, \dots, K\}$ for every point, plus the $K$ centroid coordinates. - **Assumptions:** clusters are roughly spherical and comparable in size; $K$ is chosen in advance; distance is Euclidean. - **Also known as:** Lloyd's algorithm; the modern variant is k-means++; the same procedure is called vector quantisation in signal processing. ## Why It Exists The number of ways to split $n$ points into $K$ groups grows as $\mathcal{O}(K^n)$ so brute force, trying every possible grouping, is hopeless even for small datasets. K-means is a cheap, iterative shortcut: instead of searching all partitions, it starts from a guess and repeatedly applies two steps that each lower the same objective, sliding downhill to a solution in a handful of passes. It gives up the guarantee of the *global* best partition in exchange for finding a *good* one quickly, sort of like stochastic gradient descent. ## Demo! <div class="ml-widget" data-algo="kmeans"></div> **Tips:** - When the assignment lines re-colour, that is the assignment step $c_i = \arg\min_j \lVert x_i - \mu_j \rVert^2$ choosing the nearest centroid for each point. When the centroid markers jump, that is the update step $\mu_j = \frac{1}{|S_j|}\sum_{i \in S_j} x_i$ moving each centroid to the mean of the points that just chose it. - The inertia readout is the objective $J$ itself, and it falls at every step because each of the two motions can only decrease it. - Try setting $K$ different from the true cluster count. With too few centroids the algorithm is forced to merge genuine clusters. With too many it splits one blob across several centroids. - Try hitting Clear and drawing your own data. When the clusters aren't roughly circular and clearly separate K-Means really struggles. ## Formalization The inertia the demo reports, and the quantity every step decreases, is the within-cluster sum of squared distances: $ J = \sum_{i=1}^{n} \lVert x_i - \mu_{c_i} \rVert^2 $ - **$x_i$** `[d]`: the $i$-th data point, a vector in $\mathbb{R}^d$ (the demo uses $d = 2$). - **$c_i$** `[scalar]`: the cluster index assigned to point $i$, an integer in $\{1, \dots, K\}$. - **$\mu_j$** `[d]`: the centroid of cluster $j$, the mean of all points currently assigned to it. - **$S_j$** `[set]`: the set of point indices assigned to cluster $j$, so $|S_j|$ is that cluster's size. The algorithm minimises $J$ by coordinate descent, alternating two steps until the assignments stop changing: $ \textbf{assign:} \quad c_i = \arg\min_{j} \lVert x_i - \mu_j \rVert^2 $ $ \textbf{update:} \quad \mu_j = \frac{1}{|S_j|} \sum_{i \in S_j} x_i $ Each step decreases $J$ (or leaves it unchanged), so the sequence converges. The update step is undefined when a cluster is empty ($|S_j| = 0$) — a real situation the implementation must handle explicitly rather than divide by zero. ## Building the Intuition (Deep Dive) The reason k-means converges at all is that both steps optimise the *same* objective, each holding one set of variables fixed. The assignment step fixes the centroids and chooses the cluster labels that minimise $J$; the update step fixes the labels and chooses the centroids that minimise $J$ (the mean is exactly the point that minimises summed squared distance to a set). Alternating two minimisations of one quantity is classic coordinate descent, and it can only ever drive $J$ downward or leave it flat, so the process must halt.[^1] The catch is that "downward" means to the *nearest* valley, not the *lowest* one globally. The objective $J$ is non-convex in the centroid positions: it has many local minima, and which one the algorithm reaches is decided entirely by where the centroids start. Initialise badly and the algorithm will converge, confidently, to a partition that is plainly wrong, because no single assign-or-update step can make the large jump needed to escape that configuration. The result is stable but suboptimal. The standard remedy is **k-means++** seeding. This spreads the initial centroids out: picking the first centroid at random, then choosing each subsequent one from the data with probability proportional to $D(x)^2$, the squared distance from the data point $x$ to the nearest centroid already chosen.[^2] Points far from existing centroids are therefore most likely to be picked next, so the initial set tends to land one centroid per dense region instead of clumping. The above widget seeds with k-means++ for exactly this reason, and it is the default initialiser in libraries such as scikit-learn. A counterintuitive consequence is worth noting: the danger of bad initialisation is *worst* when the clusters are tight and well separated, not when the data is messy. Tight, distinct clusters create deep, narrow valleys in $J$. Once two centroids are assigned to the same blob, the empty cluster elsewhere is too far away for any local step to repair, and the bad partition is locked in. When clusters overlap heavily there is little structure to get wrong in the first place, every partition has comparable inertia, and seeding barely matters. **Bad seeding is _more_ of a problem on data with tight distinct clusters**. Even though that's the regime where we would expect a clustering algorithm like K-Means to work best. ## Failure Modes & Gotchas - **Bad local minimum from poor initialisation:** the algorithm converges to a visibly wrong partition because the starting centroids were clumped. Symptom: re-running on the same data gives different final inertias. Mitigation: k-means++ seeding and/or running several initialisations and keeping the lowest-inertia result (scikit-learn's `n_init`). - **$K$ must be chosen in advance:** the algorithm cannot tell you how many clusters there are, and inertia always decreases as $K$ rises, so it cannot be used directly to pick $K$. Symptom: more clusters always looks "better" by inertia. Mitigation: the elbow method, silhouette score, or a model-based criterion. - **Sensitive to outliers:** because the objective function minimizes squared distances, a single extreme point disproportionately inflates $J$ and can heavily skew a centroid's position. Symptom: a centroid is pulled far from the main density of the data, sometimes "captured" by just one or two distant points, which forces genuine clusters elsewhere to merge. Mitigation: detect and remove outliers prior to clustering, or use an algorithm less sensitive to extreme values (e.g., k-medoids or DBSCAN). - **Assumes spherical, comparable-size clusters:** because assignment is by Euclidean distance to a centroid, k-means draws roughly spherical boundaries and is biased toward equal-size groups. Symptom: elongated, curved, or very unequal clusters are split or merged incorrectly. Mitigation: a density- or model-based method (e.g. DBSCAN, Gaussian mixtures) for non-spherical structure. - **Empty clusters:** a centroid can end up with no assigned points, leaving its update undefined. Symptom: a cluster vanishes or the mean becomes NaN. Mitigation: re-seed the empty centroid to a random point (as this implementation does) or to the worst-served point. - **Sensitive to feature scale:** Euclidean distance lets large-magnitude features dominate the objective. Symptom: clusters split along one feature only. Mitigation: standardise features before clustering. - **Curse of dimensionality:** Real data often lies on a low-dimensional manifold, but k-means calculates Euclidean distance across all features equally. Symptom: Distances become meaningless in high dimensions, resulting in unstable, arbitrary clusters or unexpectedly high inertia scores. Mitigation: Perform dimensionality reduction (e.g., via Principal Component Analysis (PCA)) before clustering. ## Implementation Details & Code - **Framework Equivalents:** `sklearn.cluster.KMeans` (defaults to k-means++ init and `n_init` multiple restarts); `scipy.cluster.vq.kmeans2`. - **Complexity:** Time $O(n K d)$ per iteration for $n$ points in $d$ dimensions; typically a small number of iterations to convergence. Space $O(n d + K d)$. - **Practical Heuristics:** standardise features first; use k-means++ initialisation; run several restarts and keep the lowest inertia; choose $K$ with the elbow or silhouette method rather than inertia alone. ```python import numpy as np def kmeans(X, k, iters=100, seed=0): rng = np.random.default_rng(seed) # k-means++ seeding centroids = [X[rng.integers(len(X))]] for _ in range(1, k): d2 = np.min([np.sum((X - c) ** 2, axis=1) for c in centroids], axis=0) probs = d2 / d2.sum() centroids.append(X[rng.choice(len(X), p=probs)]) centroids = np.array(centroids) for _ in range(iters): # assign: nearest centroid by squared distance dists = np.sum((X[:, None, :] - centroids[None, :, :]) ** 2, axis=2) labels = np.argmin(dists, axis=1) # update: mean of each cluster (empty clusters keep their centroid) new = np.array([X[labels == j].mean(axis=0) if np.any(labels == j) else centroids[j] for j in range(k)]) if np.allclose(new, centroids): break centroids = new return labels, centroids ``` ## References [^1]: Lloyd, S. P. (1982) 'Least squares quantization in PCM', *IEEE Transactions on Information Theory*, 28(2), pp. 129–137. Available at: https://doi.org/10.1109/TIT.1982.1056489 [^2]: Arthur, D. and Vassilvitskii, S. (2007) 'k-means++: The Advantages of Careful Seeding', *Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA '07)*, pp. 1027–1035. Available at: https://dl.acm.org/doi/10.5555/1283383.1283494