Computer ScienceAI & Machine LearningMedium

K-Means Clustering

Also known as:Lloyd's AlgorithmK-Means Algorithm

K-means clustering is an unsupervised machine learning algorithm that partitions n data points into k clusters by iteratively assigning each point to the nearest cluster centroid and recalculating centroids until convergence. The objective is to minimise the within-cluster sum of squared distances (inertia). It is widely used for customer segmentation, image compression, anomaly detection, and data summarisation.

Key Formula

Inertia J = sum over clusters j of sum over points x in cluster Cⱼ of ||x - μⱼ||²

LaTeX: J = \sum_{j=1}^{k} \sum_{\mathbf{x}_i \in C_j} \|\mathbf{x}_i - \boldsymbol{\mu}_j\|^2

Symbol	Meaning	Unit
J	Within-cluster sum of squared distances (inertia)	squared distance units
k	Number of clusters	count
C_j	Set of points assigned to cluster j	dimensionless
\boldsymbol{\mu}_j	Centroid (mean) of cluster j	same as data
\mathbf{x}_i	Data point i	same as data

Worked Example

Problem

Given 6 one-dimensional points: {2, 4, 10, 12, 3, 11}. Run one iteration of K-means with k=2 and initial centroids μ₁=2, μ₂=10.

Solution

Step 1 — Assignment: Distances from μ₁=2: |2-2|=0, |4-2|=2, |10-2|=8, |12-2|=10, |3-2|=1, |11-2|=9. Distances from μ₂=10: |2-10|=8, |4-10|=6, |10-10|=0, |12-10|=2, |3-10|=7, |11-10|=1. Cluster 1 = {2, 4, 3}, Cluster 2 = {10, 12, 11}. Step 2 — Update centroids: μ₁ = (2+4+3)/3 = 3. μ₂ = (10+12+11)/3 = 11. Step 3 — New inertia: C₁: (2-3)²+(4-3)²+(3-3)² = 1+1+0 = 2. C₂: (10-11)²+(12-11)²+(11-11)² = 1+1+0 = 2. Total J = 4.

Answer

After one iteration: Cluster 1 = {2, 3, 4} with centroid 3; Cluster 2 = {10, 11, 12} with centroid 11; Inertia = 4

K-Means Algorithm Steps

Step	Action	Goal	Notes
1. Initialise	Choose k centroids randomly (or K-means++)	Starting positions	K-means++ reduces bad starts
2. Assign	Each point → nearest centroid	Form clusters	Euclidean distance default
3. Update	Recompute centroids as cluster means	Move centres	Inertia decreases
4. Repeat	Repeat steps 2–3 until convergence	Stable clusters	Typically < 100 iterations
5. Evaluate	Compute inertia or silhouette score	Quality check	Elbow method for choosing k

Interactive Tools

Scikit-learn K-Means

Open Tool

Naftali Harris K-Means Visualisation

Open Tool

Brilliant.org Clustering

Open Tool

Wikimedia Commons, CC BY-SA

Related Terms

Computer Science

Feature Engineering

Feature engineering is the process of using domain knowledge to select, transform, or create input variables (features) from raw data to improve the performance of machine learning models. It bridges raw data and predictive algorithms by producing representations that algorithms can learn from more effectively. Techniques include normalization, one-hot encoding, polynomial feature creation, and dimensionality reduction.

Computer Science

Natural Language Processing

Natural language processing (NLP) is a field of artificial intelligence focused on enabling computers to understand, interpret, and generate human language in a useful way. It combines computational linguistics with machine learning and deep learning to process text and speech data. Core tasks include tokenisation, named entity recognition, sentiment analysis, machine translation, and question answering.

Computer Science

Support Vector Machine

A support vector machine (SVM) is a supervised learning algorithm that finds the optimal hyperplane separating two classes by maximising the margin between the nearest data points of each class, called support vectors. For non-linearly separable data, the kernel trick implicitly maps inputs into a higher-dimensional space where a linear separator exists. SVMs are effective in high-dimensional spaces and are used for classification, regression (SVR), and outlier detection.

The term "k-means" was coined by James MacQueen in 1967, with "k" representing the number of clusters and "means" referring to the arithmetic mean used to compute centroids. An equivalent algorithm was independently proposed by Forgy in 1965.

clusteringunsupervised-learningk-meanscentroidsegmentation