K-Means

Created January 3, 2021 · Updated March 4, 2026

Given a set of $$N$$ data points $\left\{x_{1}, x_{2}, \ldots, x_{N}\right\}$ partition the data points into $$K$$ clusters. We define $\boldsymbol{\mu}_k$ as a prototype (here also the mean) of the cluster $$k$$ , and minimize the sum of squares of the distances of each data point to its closest vector $\boldsymbol{\mu}_k$ :

J=\sum_{n=1}^N \sum_{k=1}^K z_{n k}\left\|\boldsymbol{x}_n-\boldsymbol{\mu}_k\right\|^2

where

\boldsymbol{z}_n

is a one-hot vector with

z_{n k}=1

$k$

is closest cluster of

\boldsymbol{x}_n

Optimization algorithm (expectation-maximization (EM) algorithm):
First, means $\boldsymbol{\mu}_k \in \mathbb{R}^D$ are initialized randomly
Then repeat until convergence ( $\mu_k$ and $z_{n k}$ do not change for any $$n$$ and $$k$$ ):
1. Expectation step: Find the assignment of the closest cluster for every data point:

\frac{\partial J}{\partial z_{n k}}=0 \Rightarrow z_{n k}= \begin{cases}1 & \text { if } k=\arg \min _j\left\|\boldsymbol{x}_n-\boldsymbol{\mu}_j\right\|^2 \\ 0 & \text { otherwise }\end{cases}

2. Maximization step: Find the means of each cluster:

\frac{\partial J}{\partial \boldsymbol{\mu}_k}=0 \Rightarrow \boldsymbol{\mu}_k=\frac{\sum_n z_{n k} \boldsymbol{x}_n}{\sum_n z_{n k}}

Every update step in K-means algorithm decreases the loss function or leaves it unchanged. The algorithm converges as each phase reduces the value of the objective function

$J$

, but they might converge to a local rather than global minimum (perform multiple random restarts and choose best minimum found)

Pros:

Finds cluster centers that minimize variance (good representation of data).
Simple to implement, widespread applications.

Cons:

All clusters have spherical distribution (same to all directions or isotropic). Generalization to solve this issue is Gaussian Mixture Model.
Hard membership/assignment (i.e. 1 or 0 membership). Soft variants are available.
Prone to local minima.
Need to choose $\mathrm{K}$ . Heuristics can be used: Clustering > Determining optimal number of clusters
Can be very slow: each iteration is $\mathrm{O}(\mathrm{KN})$ for N-dimensional points