← Back to Blog

Clustering

math > linear-algebra

2025-09-211 min read

#linear-algebra #math #vectors #clustering

This blog is based on Jong-han Kim's Linear Algebra

Clustering objective

Gj{1,,N}G_j \subset \{1, \dots, N\} is group jj, for j=1,,kj = 1, \dots, k
cic_i is group that xix_i is in: iGcIi \in G_{c_I}
group representatives: nn-vectors z1,,zkz_1, \dots, z_k
clustering objective is

Jclust=1Ni=1Nxizci2J^{\text{clust}} = \frac{1}{N} \sum^N_{i=1}\lVert x_i - z_{c_i}\rVert^2

JclustJ^{\text{clust}} small means good clustering
goal: choose clustering cic_i and representatives zjz_j to minimize JclustJ^{\text{clust}}

Partitioning the vectors given the representatives

suppose representatives z1,,zkz_1, \dots, z_k are given how do we assign the vectors to groups, i.e., choose c1,,cNc_1, \dots, c_N?

cic_i only appears in term xizci2 in Jclust\lVert x_i - z_{c_i}\rVert^2 \ \text{in} \ J^{\text{clust}}
to minimize over cic_i, choose cic_i so xizci2=minjxizj2\lVert x_i - z_{c_i}\rVert^2 = \min_j{\lVert x_i - z_j\rVert^2}

kk-means algorithm

alternate between updating the partition, then the representatives
a famous algorithm called k-means
objective JclustJ^{\text{clust}} decreases in each step

given x1,,xNRn and z1,,zkRnrepeatUpdate partition: assigni to Gj,j=argminjxizj2Update centroids: zj=1GjiGjxiuntil z1,,zkstop changing\begin{align*} &\text{given} \ x_1, \dots, x_N \in \mathbb{R}^n \ \text{and} \ z_1, \dots, z_k \in \mathbb{R}^n \\ &\text{repeat} \\ &\quad \text{Update partition: assign} i \ \text{to} \ G_j, j = \mathbf{argmin}_{j^\prime}\lVert x_i - z_j{^\prime}\rVert^2 \\ &\quad \text{Update centroids:} \ z_j = \frac{1}{\lvert G_j\rvert}\sum_{i \in G_j} x_i \\ &\text{until} \ z_1, \dots, z_k \text{stop changing} \end{align*}

cluster1
cluster1

코드

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

data = np.concat([
  np.random.randn(100, 2) + (10, 5),
  np.random.randn(100, 2) + (5, 5),
  np.random.randn(100, 2) + (-5, 5)
])

init_centre = data[0:3]

print(init_centre)

def kmeans(k):
  for i in range(10):
    dist = np.sum(np.sqrt((data[:] - init_centre[:, np.newaxis])**2), axis=2)
    groups = np.argmin(dist, axis=0)

    print(groups)

    old_centroids = init_centre.copy()

    for i in range(k):
        if data[groups == i].shape[0] > 0:
            init_centre[i] = data[groups == i].mean(axis=0)

    if np.abs((init_centre - old_centroids).sum()) < 0.1:
      break

    plt.figure(figsize=(7, 5))
    plt.title("$k$-means clustering")
    plt.scatter(data[:, 0], data[:, 1], c=groups)
    plt.scatter(old_centroids[:, 0], old_centroids[:, 1], c="red", label="centre")
    plt.legend()
    plt.show()

kmeans(3)