Broadcasting in Python: K-means algorithm

Cost function: for a partition

C_{1}, \dots, C_{k}

n

observations in

k

clusters, the quality of the clustering is given by the following cost function:

\begin{array}{r} \sum_{j = 1}^{k} \sum_{i \in C_{j}} ‖ x_{i} - μ_{j} ‖^{2} where, μ_{j} = \frac{1}{| C_{j} |} \sum_{i \in C_{j}} x_{i} . \end{array}

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

k-means algorithm:

take
$k$ random centers
(a) when centers are fixed, update the partition by allocating each data sample to its nearest center
(b) when the partition is fixed, update the centers according to the formula:
$μ_{j} = \frac{1}{| C_{j} |} \sum_{i \in C_{j}} x_{i}$
iterate steps (a) and (b)

Python code:
To focus on brodcasting in the partition step (a), we run the algorithm for a fixed number of iterations (instead of checking convergence criteria).


























import numpy as np

class Kmeans:
    def __init__(self, k, num_iter=50):
        self.k = k
        self.num_iter = num_iter
        
    def partition_step(self):
        diff = self.data[:, None] - self.centroids[None]  # (n, k, d)
        distances = np.einsum('nkd,nkd->nk', diff, diff)  # (n, k)
        self.labels = np.argmin(distances, axis=1)  # (n,)

    def centers_step(self):
        self.centroids = np.stack([self.data[self.labels==i].mean(axis=0) \
                                   for i in range(self.k)])
    
    def fit(self, data):
        n, d = data.shape
        self.centroids = data[np.random.choice(n, self.k, \
                                               replace=False)] # (k, d)
        self.labels = np.empty(n)  # (n,)
        self.data = data
        for _ in range(self.num_iter):
            self.partition_step()
            self.centers_step()
        return self.centroids, self.labels

This code is 60x faster than a naive implementation with a for loop see kmeans_broadcast.ipynb

Applications
Below, we run k-means algorithm on all the colors of all pixels and then replace the original color (on the left) by its corresponding centers (code available in this repo):

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Broadcasting in Python: K-means algorithm

tags: public python machine learning

Read more

A* algorithm

Basics about probability distribution and Gaussians

Autodiff and Backpropagation

Transformers using Named Tensor Notation

tags: `public` `python` `machine learning`