Clustering#

Clustering is a fundamental technique in unsupervised machine learning that aims to group similar data points together based on their characteristics. One popular clustering algorithm is k-means, which partitions data into k distinct clusters.

The k-means algorithm starts by randomly initializing k cluster centroids. It then iteratively assigns each data point to the nearest centroid and updates the centroids based on the mean of the assigned points. This process continues until convergence, where the centroids no longer change significantly or a maximum number of iterations is reached.

K-means clustering is widely used in various domains, such as customer segmentation, image compression, anomaly detection, and recommendation systems. It provides a simple and efficient way to discover patterns and structure within datasets.

import numpy as np
from sklearn.datasets import make_blobs, make_moons
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

The first step, as always, is to get us some data. We are going to use the scikit library to generate us blobs of data. The red dots depict the cluster centers.

k = 4
X, y, centers = make_blobs(n_samples=500, centers=k, return_centers=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
plt.scatter(X_train[:, 0], X_train[:, 1])
plt.plot(centers[:, 0], centers[:, 1], 'ro')
[<matplotlib.lines.Line2D at 0x7fcd6a0bde90>]
_images/9e46b4259e907007081194c53fac8e04678b988f2a74f3fc3102ef67bd52fc4f.png

Among all partitions \( C_0 \cup C_1 ... \cup C_k \), we find the one that minimizes:

\[ \min*{C_0 \cup C_1 ... \cup C_k} \sum*{i=1}^{k} \sum*{x \in C_i} ||x_i - \frac{1}{|C_i|} \sum*{x_j \in C_i} x_j ||^2 \]
  • \( k \) are the number of clusters

  • \( C_i \) is the i-th partition

  • \( | C_i | \) is the number of samples belonging to \( C_i \)

  • \( C_0 \cup C_1 ... \cup C_k \) is the union of all partitions

  • \( x \in C_i \) means x in partition \( C_i \)

It calculates the sum of the squared distances between each data point and the mean of its assigned cluster. The goal is to minimize this sum by finding the optimal partitioning of the data into clusters. For the algorithm below, you can safely ignore this equation.

def kmeans(X, k, max_iter=100):
    # Choose k random samples as center of clusters
    centroids = X[np.random.choice(len(X), k, replace=False)]
    for i in range(max_iter):
        # Compute distance of each sample to each centroid
        dist = np.linalg.norm(X[:, None] - centroids, axis=2)
        # Assign each sample to the closest centroid
        labels = np.argmin(dist, axis=1)
        # Compute the new centroids by taking the mean of all samples assigned to each centroid
        new_centroids = np.array(
            [X[labels == i].mean(axis=0) for i in range(k)])
        if np.all(centroids == new_centroids):
            print(f'Converged after {i} iterations')
            break
        centroids = new_centroids
    return labels, centroids
labels, centroids = kmeans(X_train, k)
centroids = centroids[np.argsort(centroids[:, 0])]
plt.scatter(X_train[:, 0], X_train[:, 1], c=labels)
plt.plot(centers[:, 0], centers[:, 1], 'ro')
plt.plot(centroids[:, 0], centroids[:, 1], 'go')
Converged after 13 iterations
[<matplotlib.lines.Line2D at 0x7fcd67c4bb50>]
_images/6bf0bd0fba97f8148763864ae093b4f8d6879411a4723e75e61d7a4df832ecb2.png

Now that we have our clusters (green dots), we can use them for classification.

def classify(X, centroids):
    dist = np.linalg.norm(X[:, None] - centroids, axis=2)
    return np.argmin(dist, axis=1)


labels = classify(X_test, centroids)
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test)
<matplotlib.collections.PathCollection at 0x7fcd67cc0950>
_images/cbfe544ed8f5864a5cee6fdb93bb9232a1d9cb641db38a0d61a32203180a67e9.png
plt.scatter(X_test[:, 0], X_test[:, 1], c=labels)
<matplotlib.collections.PathCollection at 0x7fcd69d53d90>
_images/d841a0ae031faa26495271842b71de709d0a2aaea43f344e8aa2c2d761ce9577.png

Disadvantages#

  1. You have to figure k out by yourself, although there are methods like the elbow method to help you with that.

  2. We generated a blob dataset, because the shape of the data matters for k-means. If the data were not spherical, k-means would be a poor choice.

  3. k-means cannot handle skewed data very well. By its nature, the centroids will always prefer bigger clusters.

Elbow Method#

We iterate over multiple different values of \(k\) and look for the sweet spot, where the sum of distances between the samples and their cluster’s centroid starts to become low. The name becomes obvious if we look at the shape of the resulting graph.

inertia = []
for k in range(1, 10):
    labels, centroids = kmeans(X_train, k)
    inertia.append(np.linalg.norm(X_train - centroids[labels]))

plt.plot(range(1, 10), inertia)
Converged after 1 iterations
Converged after 3 iterations
Converged after 4 iterations
Converged after 8 iterations
Converged after 6 iterations
Converged after 10 iterations
Converged after 8 iterations
Converged after 9 iterations
Converged after 13 iterations
[<matplotlib.lines.Line2D at 0x7fcd69d53ed0>]
_images/23636c183802fb87ba08039a05673d0c442d114321d86cfb7e672b00d7c2cb27.png

Non-spherical data#

k = 2
X, y = make_moons(n_samples=50)
labels, centroids = kmeans(X, k)
plt.scatter(X[:, 0], X[:, 1], c=labels)
plt.plot(centroids[:, 0], centroids[:, 1], 'go')
Converged after 2 iterations
[<matplotlib.lines.Line2D at 0x7fcd69e1c610>]
_images/6c59a1ebc5a9c58e8ce0fcde855b670aaf49cab0c5a70b7f084680ddb0169282.png

Skewed data#

X, y, centers = make_blobs(n_samples=[25, 50, 200], cluster_std=[
                           0.1, 0.1, 0.2], centers=[(0.1, 0.5), (0.4, 0), (0.6, 0.6)], return_centers=True)
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.plot(centers[:, 0], centers[:, 1], 'ro')
[<matplotlib.lines.Line2D at 0x7fcd67b5fa50>]
_images/39d3d13c383b08bd2d18a169ec8d831c19d5836d4adcbf974c45210af3ea4584.png
labels, centroids = kmeans(X, 3)
plt.scatter(X[:, 0], X[:, 1], c=labels)
plt.plot(centroids[:, 0], centroids[:, 1], 'go')
Converged after 9 iterations
[<matplotlib.lines.Line2D at 0x7fcd69d84610>]
_images/9bf6352d50e2d3b895ad23b7dace7a4c4d56f667967399452ce2c9d8f04cb5bb.png

As you can see, k-means does not handle non-spherical data well and skewed data might also create a problem.

Nevertheless its simplicity and ease of use is often one of the first choices to cluster data. There are excellent libraries for this available.