## At a Glance
- **Definition:** An unsupervised algorithm that partitions points into $K$ groups by alternately assigning each point to its nearest centroid and moving each centroid to the mean of its assigned points.
- **Formula:** minimise $J = \sum_{i=1}^{n} \lVert x_i - \mu_{c_i} \rVert^2$
- **Range / output:** returns a hard cluster label $c_i \in \{1, \dots, K\}$ for every point, plus the $K$ centroid coordinates.
- **Assumptions:** clusters are roughly spherical and comparable in size; $K$ is chosen in advance; distance is Euclidean.
- **Also known as:** Lloyd's algorithm; the seeding variant is k-means++; the same procedure is called vector quantisation in signal processing.
---
## Why It Exists
Grouping unlabelled points into clusters seems to call for trying every possible partition and keeping the tightest one — but the number of ways to split $n$ points into $K$ groups is astronomically large, so brute force is hopeless even for small datasets. K-means exists as a cheap, iterative shortcut: instead of searching all partitions, it starts from a guess and repeatedly applies two simple steps that each provably lower the same objective, sliding downhill to a solution in a handful of passes. The trade is honest and worth stating up front — it gives up the guarantee of the *global* best partition in exchange for finding a *good* one quickly.
## Demo
<div class="ml-widget" data-algo="kmeans"></div>
**Sandbox Mechanics:**
- **Action:** The *Data clusters* slider regenerates the point cloud as that many Gaussian blobs; the *K* slider sets how many centroids the algorithm fits. *Step* runs one iteration, *Animate* runs them on a timer until convergence, *Clear* empties the canvas, and clicking or dragging on the canvas adds points by hand.
- **Observation:** Each centroid is drawn as a ringed marker; faint lines connect every point to the centroid it is currently assigned to, and points take their centroid's colour. On each step the centroids jump to new positions and the assignment lines visibly re-snap, until nothing moves and a "converged" badge appears. The status line reports the inertia and, for generated data, the true cluster count.
- **Mathematical Mapping:** The two visible motions are exactly the two terms of the algorithm. When the assignment lines re-colour, that is the assignment step $c_i = \arg\min_j \lVert x_i - \mu_j \rVert^2$ choosing the nearest centroid for each point; when the ringed markers jump, that is the update step $\mu_j = \frac{1}{|S_j|}\sum_{i \in S_j} x_i$ moving each centroid to the mean of the points that just chose it. The inertia readout is the objective $J$ itself, and it falls at every step because each of the two motions can only decrease it — which is why the demo always slides to a stop rather than oscillating. Setting $K$ different from the true cluster count is instructive: with too few centroids the algorithm is forced to merge genuine clusters, and with too many it splits one blob across several centroids, and in both cases the assignment lines show the compromise directly.
## Formalization
The inertia the demo reports, and the quantity every step decreases, is the within-cluster sum of squared distances:
$ J = \sum_{i=1}^{n} \lVert x_i - \mu_{c_i} \rVert^2 $
- **$x_i$** `[d]`: the $i$-th data point, a vector in $\mathbb{R}^d$ (the demo uses $d = 2$).
- **$c_i$** `[scalar]`: the cluster index assigned to point $i$, an integer in $\{1, \dots, K\}$.
- **$\mu_j$** `[d]`: the centroid of cluster $j$, the mean of all points currently assigned to it.
- **$S_j$** `[set]`: the set of point indices assigned to cluster $j$, so $|S_j|$ is that cluster's size.
The algorithm minimises $J$ by coordinate descent, alternating two steps until the assignments stop changing:
$ \textbf{assign:} \quad c_i = \arg\min_{j} \lVert x_i - \mu_j \rVert^2 $
$\textbf{update:} \quad \mu_j = \frac{1}{|S_j|} \sum_{i \in S_j} x_i $
Each step decreases $J$ (or leaves it unchanged), so the sequence converges. The update step is undefined when a cluster is empty ($|S_j| = 0$) — a real situation the implementation must handle explicitly rather than divide by zero.
## Building the Intuition (Deep Dive)
The reason k-means converges at all is that both steps optimise the *same* objective, each holding one set of variables fixed. The assignment step fixes the centroids and chooses the cluster labels that minimise $J$; the update step fixes the labels and chooses the centroids that minimise $J$ (the mean is exactly the point that minimises summed squared distance to a set). Alternating two minimisations of one bounded-below quantity is classic coordinate descent, and it can only ever drive $J$ downward or leave it flat, so the process must halt.[^1]
The catch is that "downward" means *to the nearest valley, not the lowest one*. The objective $J$ is non-convex in the centroid positions: it has many local minima, and which one the algorithm reaches is decided entirely by where the centroids start. Initialise badly — two centroids inside one true cluster and none in another — and the algorithm will converge, confidently, to a partition that is plainly wrong, because no single assign-or-update step can make the large jump needed to escape that configuration. The result is stable but suboptimal.
The standard remedy is **k-means++** seeding, which spreads the initial centroids out: it picks the first centroid uniformly at random, then chooses each subsequent one from the data with probability proportional to $D(x)^2$, the squared distance from $x$ to the nearest centroid already chosen.[^2] Points far from existing centroids are therefore likely to be picked next, so the initial set tends to land one centroid per dense region instead of clumping. This widget seeds with k-means++ for exactly this reason, and it is the default initialiser in libraries such as scikit-learn.
A counterintuitive consequence is worth internalising: the danger of bad initialisation is *worst* when the clusters are tight and well separated, not when the data is messy. Tight, distinct clusters create deep, narrow valleys in $J$ — once two centroids are committed to the same blob, the empty cluster elsewhere is too far away for any local step to repair, and the bad partition is locked in. When clusters overlap heavily there is little structure to get wrong in the first place, every partition has comparable inertia, and seeding barely matters. The seeding problem is therefore a property of *strong* cluster structure, which is precisely the regime where clustering is supposed to work well.
## Failure Modes & Gotchas
- **Bad local minimum from poor initialisation:** the algorithm converges to a visibly wrong partition because the starting centroids were clumped. Symptom: re-running on the same data gives different final inertias. Mitigation: k-means++ seeding and/or running several initialisations and keeping the lowest-inertia result (scikit-learn's `n_init`).
- **$K$ must be chosen in advance:** the algorithm cannot tell you how many clusters there are, and inertia always decreases as $K$ rises, so it cannot be used directly to pick $K$. Symptom: more clusters always looks "better" by inertia. Mitigation: the elbow method, silhouette score, or a model-based criterion.
- **Assumes spherical, comparable-size clusters:** because assignment is by Euclidean distance to a centroid, k-means draws roughly spherical boundaries and is biased toward equal-size groups. Symptom: elongated, curved, or very unequal clusters are split or merged incorrectly. Mitigation: a density- or model-based method (e.g. [[DBSCAN]], Gaussian mixtures) for non-spherical structure.
- **Empty clusters:** a centroid can end up with no assigned points, leaving its update undefined. Symptom: a cluster vanishes or the mean becomes NaN. Mitigation: re-seed the empty centroid to a random point (as this implementation does) or to the worst-served point.
- **Sensitive to feature scale:** Euclidean distance lets large-magnitude features dominate the objective. Symptom: clusters split along one feature only. Mitigation: standardise features before clustering.
## Implementation Details & Code
- **Framework Equivalents:** `sklearn.cluster.KMeans` (defaults to k-means++ init and `n_init` multiple restarts); `scipy.cluster.vq.kmeans2`.
- **Complexity:** Time $O(n K d)$ per iteration for $n$ points in $d$ dimensions; typically a small number of iterations to convergence. Space $O(n d + K d)$.
- **Practical Heuristics:** standardise features first; use k-means++ initialisation; run several restarts and keep the lowest inertia; choose $K$ with the elbow or silhouette method rather than inertia alone.
```python
import numpy as np
def kmeans(X, k, iters=100, seed=0):
rng = np.random.default_rng(seed)
# k-means++ seeding
centroids = [X[rng.integers(len(X))]]
for _ in range(1, k):
d2 = np.min([np.sum((X - c) ** 2, axis=1) for c in centroids], axis=0)
probs = d2 / d2.sum()
centroids.append(X[rng.choice(len(X), p=probs)])
centroids = np.array(centroids)
for _ in range(iters):
# assign: nearest centroid by squared distance
dists = np.sum((X[:, None, :] - centroids[None, :, :]) ** 2, axis=2)
labels = np.argmin(dists, axis=1)
# update: mean of each cluster (empty clusters keep their centroid)
new = np.array([X[labels == j].mean(axis=0) if np.any(labels == j)
else centroids[j] for j in range(k)])
if np.allclose(new, centroids):
break
centroids = new
return labels, centroids
```
## Related Concepts
<!-- AUTHOR: add wikilinks by hand to notes that exist in the vault. Suggested neighbours from Pass 1 (not yet authored): DBSCAN (density-based, finds non-spherical clusters without a preset K), K-Nearest Neighbors (shares the "k" name but is supervised — worth disambiguating). Uncomment once the target notes exist. -->
<!-- - [[DBSCAN]] -->
<!-- - [[K-Nearest Neighbors]] -->
---
## References
[^1]: Lloyd, S. P. (1982) 'Least squares quantization in PCM', *IEEE Transactions on Information Theory*, 28(2), pp. 129–137. Available at: https://doi.org/10.1109/TIT.1982.1056489
[^2]: Arthur, D. and Vassilvitskii, S. (2007) 'k-means++: The Advantages of Careful Seeding', *Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA '07)*, pp. 1027–1035. Available at: https://dl.acm.org/doi/10.5555/1283383.1283494
<!-- ==========================================================================
WIDGET TECHNICAL SPECIFICATION (Phase 3) — hidden from the published note.
### SPECIFICATION: kmeans
1. STATE SCHEMA (all coordinates in model space [-1,1]):
{
seed, rng, // seeded PRNG (MLMath.makeRng) — reproducible default
points: [[x,y], ...], // current dataset
k, dataClusters, // K centroids / number of true blobs
MAX_DATA_CLUSTERS = 6, PTS_PER_BLOB = 40, BLOB_SD = 0.11,
centroids: [[x,y], ...],
assignments: [<int>, ...], // parallel to points
converged, trueClusters, // trueClusters null once hand-drawn
animating, lastStepTime, stepIntervalMs = 600,
isDragging, lastDragTime, DRAG_THROTTLE_MS = 70,
}
2. MATH ENGINE ADDITIONS (publish.js Layer 2):
- Reused: makeRng, gaussian, gaussianCluster (seeded data spawn);
euclideanSq (assign + inertia); argmin (assign); mean (update).
- Added: seedPlusPlus(points, k, rng) → number[][] — k-means++ D²-weighted
seeding; pure, randomness via the rng argument only.
3. INTERACTION MAPPING:
- Controls (buildControls): Data-clusters slider (1..6, live re-spawn),
K slider (1..8, re-seeds on change), Animate/Pause, Step, Clear.
- Canvas pointerdown / pointermove (throttled) / pointerup / pointercancel:
add a point via this.eventToModel(e); sets trueClusters=null (hand-drawn).
4. RENDER LOOP (draw(ctx, W, H); onFrame paces Animate, suppressed on reduced-motion):
- clearRect; faint per-cluster assignment lines; data points coloured by
assignment; centroids as cluster-fill + white ring on a dark halo;
"converged" badge top-right. Status: inertia, true-cluster hint (greyed
Data-clusters label when hand-drawn).
=========================================================================== -->