2020 Machine Learning Homework 6

School: National Chiao Tung University
Name: Shao-Wei Chiu

tags: `NCTU`, `Machine Learning`, `Kernel K-means`, `Spectral Clustering`, `GitHub`

Note: This report was made on Hackmd.io and restricted by the .pdf format, the .gif animation would not display. Please view it on https://hackmd.io/@swchiu/BJwOjuc3U, thanks.

Kernel K-means

Kernel k-means is an approach to k-means algorithm, but mapping the data into higher degree dimentions.
And the mapping-function called Kernel.

K-means algorithm is that after comparing the data similarity, we cluster the more similarity datas to the same group.
K means that there are k number of group to cluster.

For the regular K-means, we use the following formula to compare and cluster data.

a r g m i n_{(C_{1}, μ_{1}) . . . (C_{k}, μ_{k})} \sum_{i = 1}^{k} \sum_{x_{j} \in C_{i}} | | x_{j} - μ_{i} | |^{2}

How about kernel k-means?

we transfrom the data to the higher degree, and do the same k-means algorithm on it.

a r g m i n_{(C_{1}, μ_{1}^{ϕ}) . . . (C_{k}, μ_{k}^{ϕ})} \sum_{i = 1}^{k} \sum_{ϕ (x_{j}) \in C_{i}} | | ϕ (x_{j}) - μ_{i}^{ϕ} | |^{2}

\begin{aligned} | | ϕ (x_{j}) - μ_{i}^{ϕ} | |^{2} & = ϕ^{T} (x_{j}) ϕ (x_{j}) - 2 ϕ^{T} (x_{j}) \frac{1}{| C_{k} |} \sum_{x_{n} \in c_{k}} ϕ (x_{n}) + \frac{1}{| C_{k} |^{2}} \sum_{x_{p} \in C_{k}} \sum_{x_{q} \in C_{k}} ϕ^{T} (x_{p}) ϕ (x_{q}) \\ = K (x_{j}, x_{j}) - \frac{2}{| C_{k} |} \sum_{x_{n} \in c_{k}} K (x_{j}, x_{n}) + \frac{1}{| C_{k} |^{2}} \sum_{x_{p} \in C_{k}} \sum_{x_{q} \in C_{k}} K (x_{p}, x_{q}) \end{aligned}

Instead of update

μ_{k}

in the k-means algorithm, update only

C_{k}

in the kernel k-means algorithm.

The Work

kernel function:
$e^{- γ_{1} | | S (x) - S (x^{'}) | |^{2}} \times e^{- γ_{2} | | C (x) - C (x^{'}) | |^{2}}$
Input data: Two 100*100 images

Step 1

Prepare image data for precompute gram matrix (kernel)

def img_formater(img):
    n = img.shape[0]*img.shape[1]
    spatial_data = []
    color_data = []
    for i in range(img.shape[0]):
        for j in range(img.shape[1]):
            spatial_data.append([i, j])
            color_data.append(img[i][j])
    return np.array(spatial_data), np.array(color_data, dtype=int)
    
img = imageio.imread(img_path)
spatial_data, color_data = img_formater(img)

Step 2

Compute gram matrix

At the first, I implement this part by for-loop which is a trivial way.
But I found that it is too time-consuming! Because there are
$10^{4}$ data points for our input data, and we need to calculate a
$10^{4} \times 10^{4}$ gram matrix.
So I replace my for-loop implementation into matrix-computation for efficient.

e u c l i d e a n^{2} = | | u - v | |^{2} = (u - v)^{T} (u - v) = | | u | |^{2} - 2 u^{T} v + | | v | |^{2}

Above formula is suitable for vector which is stroed in

d \times 1

matrix.
But in this work, the vectors are stored in

1 \times d

matrix, so I applied the following formula revised.

e u c l i d e a n^{2} = | | u - v | |^{2} = (u - v) (u - v)^{T} = | | u | |^{2} - 2 u v^{T} + | | v | |^{2}

And for the performance, I take all vectors into one matrix

D

which is

n \times d

for calculate gram matrix

G

which is

n \times n

, let

E

as the euclidean matrix

D = {[\begin{matrix} V_{1} \\ V_{2} \\ . . . \\ V_{n - 1} \\ V_{n} \end{matrix}]}_{n \times d}, V_{i} = {[\begin{matrix} v_{i 1} & v_{i 2} & . . . & v_{i d} \end{matrix}]}_{1 \times d} \begin{aligned} E & = [\begin{array}{c} | | V_{1}, V_{1} | |^{2} & | | V_{1}, V_{2} | |^{2} & . . . & | | V_{1}, V_{n} | |^{2} \\ | | V_{2}, V_{1} | |^{2} & | | V_{2}, V_{2} | |^{2} & . . . & | | V_{2}, V_{n} | |^{2} \\ : & : & . . . & : \\ | | V_{n}, V_{1} | |^{2} & | | V_{n}, V_{2} | |^{2} & . . . & | | V_{n}, V_{n} | |^{2} \end{array}] \\ = [\begin{array}{c} | | V_{1} | |^{2} - 2 V_{1} V_{1}^{T} + | | V_{1} | |^{2} & . . . & | | V_{1} | |^{2} - 2 V_{1} V_{n}^{T} + | | V_{n} | |^{2} \\ : & . . . & : \\ | | V_{n} | |^{2} - 2 V_{n} V_{1}^{T} + | | V_{1} | |^{2} & . . . & | | V_{n} | |^{2} - 2 V_{n} V_{n}^{T} + | | V_{n} | |^{2} \end{array}] \\ = [\begin{array}{c} | | V_{1} | |^{2} & . . . & | | V_{1} | |^{2} \\ | | V_{2} | |^{2} & . . . & | | V_{2} | |^{2} \\ : & . . . & : \\ | | V_{n} | |^{2} & . . . & | | V_{n} | |^{2} \end{array}] - 2 D D^{T} + {[\begin{array}{c} | | V_{1} | |^{2} & . . . & | | V_{1} | |^{2} \\ | | V_{2} | |^{2} & . . . & | | V_{2} | |^{2} \\ : & . . . & : \\ | | V_{n} | |^{2} & . . . & | | V_{n} | |^{2} \end{array}]}^{T} \\ = D^{2} {[\begin{array}{c} 1 & 1 & . . . & 1 \\ : & : & . . . & : \\ 1 & 1 & . . . & 1 \end{array}]}_{d \times n} - 2 D D^{T} + {[\begin{array}{c} 1 & 1 & . . . & 1 \\ : & : & . . . & : \\ 1 & 1 & . . . & 1 \end{array}]}_{n \times d} (D^{T})^{2} \end{aligned}

So we could calculate

G

G = e^{E}

without for-loop.

#Origianl trvial implementation
def rbf_img(u, v, g=0.0001):
    s_dis = scipy.spatial.distance.euclidean(u[0], v[0])
    c_dis = scipy.spatial.distance.euclidean(u[1], v[1])
    return math.exp(-1*g*s_dis**2 - g*c_dis**2)

def gram_matrix(data, path, kernel=rbf_img):
    gram = np.ones((len(data), len(data)))
    for i in range(len(data)):
        for j in range(i, len(data)):
            gram[i][j] = kernel(data[i], data[j])
            gram[j][i] = gram[i][j]
    return gram
    
#---------------------------------------------------------------------#  

#New implementation using matrix computation
def euclidean(self, u, v):
    #This method is defined in kmeans.euclidean
    return np.matmul(u**2, np.ones((u.shape[1],v.shape[0]))) \
        -2*np.matmul(u, v.T) \
        +np.matmul(np.ones((u.shape[0], v.shape[1])), (v.T)**2)

def rbf(u, v, g=10**-4):
    return np.exp(-1*g*kmeans.euclidean(kmeans, u, v))
    
gram = rbf(spatial_data, spatial_data) \ 
    * rbf(color_data, color_data)

Step 3

Run Kernel K-means, recall the formula:

L e t s_{j k} = | | ϕ (x_{j}) - μ_{k}^{ϕ} | |^{2} = K (x_{j}, x_{j}) - \frac{2}{| C_{k} |} \sum_{x_{n} \in c_{k}} K (x_{j}, x_{n}) + \frac{1}{| C_{k} |^{2}} \sum_{x_{p} \in C_{k}} \sum_{x_{q} \in C_{k}} K (x_{p}, x_{q})

we compare for the

| | ϕ (x_{j}) - μ_{k}^{ϕ} | |^{2}

for measuring the distance between the

k^{t h}

cluster and the

j^{t h}

mapped data point at every vector

x_{j}

on cluster

C_{k}

, and for the every

C_{k}

, the first terms are the same, so we could ignore it directly. I still use matrix computaion on this part, and I would explain the procedure of my derivation as below.

For each

x_{j}

, we have

k

values of corresponding to the

k^{t h}

cluster ( denoted by

s_{j k}

), and we would go throgh all datas, which means that we have

n \times k

values in totally and it is suitable for matrix computation!
Let

S_{n \times k}

is the distance matrix as mentioned ( which is dis in the code segment ).

\begin{aligned} s_{j k} & = \frac{2}{| C_{k} |} \sum_{x_{n} \in c_{k}} K (x_{j}, x_{n}) + \frac{1}{| C_{k} |^{2}} \sum_{x_{p} \in C_{k}} \sum_{x_{q} \in C_{k}} K (x_{p}, x_{q}) \\ = \frac{2}{| C_{k} |} [\begin{array}{c} K (x_{j}, x_{1}) & . . . & K (x_{j}, x_{n}) \end{array}] C_{k} + \frac{1}{| C_{k} |^{2}} {C_{k}}^{T} G C_{k} \end{aligned}

s_{j k}

is the

j^{t h}

row

k^{t h}

col element of

S_{n \times k}

, means the distance between the

j^{t h}

mapped data point and the

k^{t h}

cluster.

I want to do a matrix computation instead of

n

times computation at each data point. So I need expand the above euation to from

1 \times 1

n \times k

\begin{aligned} S_{n \times k} & = \frac{2}{| C |} {[\begin{array}{c} K (x_{1}, x_{1}) & K (x_{1}, x_{2}) & . . . & K (x_{1}, x_{n}) \\ K (x_{2}, x_{1}) & K (x_{2}, x_{2}) & . . . & K (x_{2}, x_{n}) \\ : & : & . . . & : \\ K (x_{n}, x_{1}) & K (x_{n}, x_{2}) & . . . & K (x_{n}, x_{n}) \end{array}]}_{n \times n} C_{n \times k} \\ + \frac{1}{| C |^{2}} {[\begin{array}{c} 1 & 1 & . . . & 1 \\ : & : & . . . & : \\ 1 & 1 & . . . & 1 \end{array}]}_{n \times k} {C_{n \times k}}^{T} G_{n \times n} C_{n \times k} \\ S_{n \times k} & = \frac{2}{| C |} G_{n \times n} C_{n \times k} + \frac{1}{| C |^{2}} {[\begin{array}{c} 1 & . . . & 1 \\ : & . . . & : \\ 1 & . . . & 1 \end{array}]}_{n \times k} {C_{n \times k}}^{T} G_{n \times n} C_{n \times k} \end{aligned}

For the second term, we only need the diagonal elements, so in the python code, I multiply use a diagonal identity matrix np.eye().

And there are many varies of initialization method, e.g. k-means++ which decide the initial cluster by maxmizing their distance. I would compare k-means++ and the traditional way at Result part.

def __get_init(self, method):
        if method == 'kmeans++':
            return self.__kmeanspp
        elif method == 'default':
            return self.__traditional
        else:
            print('ERROR: \'{}\' is not a pre-defined initialize method'.format(method))
            return exit(0)

def __kmeanspp(self):
    n, d = self.data.shape
    centers = np.array([self.data[np.random.randint(n), :d]])
    for i in range(self.k-1):
        dist = self.euclidean(self.data, centers)
        dist = np.min(dist, axis=1)
        next_center = np.argmax(dist, axis=0)
        centers = np.vstack((centers, self.data[next_center, :]))
    return centers

def __traditional(self):
    return np.array(self.data[np.random.choice(self.data.shape[0], size=self.k, replace=False), :])

def __kernel_trick(self, gram, ck):
    c_count = np.sum(ck, axis=0)
    dist = -2*np.matmul(gram, ck)/c_count + \
        np.matmul(np.ones(ck.shape), (np.matmul(ck.T, np.matmul(gram, ck)))*np.eye(ck.shape[1]))/(c_count**2)
    return dist

def run(self):
    #initial cluster
    centers = self.init()
    dist = self.euclidean(self.data, centers)
    ck = np.zeros((self.data.shape[0], self.k))
    ck[np.arange(dist.shape[0]), np.argmin(dist, axis=1)] = 1

    record = []
    record.append(ck)
    record_iter = 0
    for i in range(self.max_iter):
        #E-step
        if self.is_kernel:
            dist = self.__kernel_trick(self.data, ck)
        else:
            dist = self.euclidean(self.data, centers)

        #M-step
        update_ck = np.zeros(dist.shape)
        update_ck[np.arange(dist.shape[0]),np.argmin(dist, axis=1)] = 1
        delta_ck = np.count_nonzero(np.abs(update_ck - ck))
        update_centers = np.matmul(update_ck.T, self.data)/np.sum(update_ck, axis=0, keepdims=True).T

        record.append(update_ck)
        if delta_ck == 0:
            self.converge -= 1
            if self.converge == 0:
                record_iter = i+1            
                if self.keep_log == False:
                    break

        ck = update_ck
        centers = update_centers
    return record, record_iter

Step 4

Visualization

k_visual = colors.to_rgba_array(['tab:blue', \ 
                        'tab:orange', \ 
                        'tab:green', \ 
                        'tab:red', \ 
                        'tab:purple', \
                        'tab:brown'])
def visualizer(record, save_path, figsize=(100,100,4)):
    gif = []
    for i in range(len(record)):
        c_id = np.argmax(record[i], axis=1)
        img = np.zeros(figsize, dtype=np.uint8)
        for j in range(c_id.shape[0]):
            m, n = (int(j/100), int(j%100))
            img[m][n] = 255*k_visual[c_id[j]]
        gif.append(img)
    imageio.mimsave(save_path, gif)
        
def merge_gifs(gifs, max_fram, id, method):
    gif = []
    for i in range(len(gifs)):
        gif.append(imageio.get_reader('output/'+gifs[i]))

    new_gif = imageio.get_writer('output/image'+str(id)+'_'+method+'.gif')
    
    if max_fram + 5 < 100:
        max_fram += 5
    for frame_number in range(max_fram):
        img = []
        for i in range(len(gif)):
            img.append(gif[i].get_next_data())
        new_image = np.hstack(img)
        new_gif.append_data(new_image)
    for i in range(len(gif)):
        gif[i].close()
    new_gif.close()

Screen Shot

Result and The Discussion

image1 (k=2, 3, 4, 5, 6)

Original image:

$γ = 10^{- 6}$ :

$γ = 10^{- 5}$ :

$γ = 10^{- 4}$ :

$γ = 10^{- 3}$ :

$γ = 10^{- 2}$ :

$γ = 10^{- 1}$ :
image2 (k=2, 3, 4, 5, 6)

$γ = 10^{- 6}$ :

$γ = 10^{- 5}$ :

$γ = 10^{- 4}$ :

$γ = 10^{- 3}$ :

$γ = 10^{- 2}$ :

$γ = 10^{- 1}$ :

For the value of
$γ$ parameter in the RBF kernel, I found that when
$γ$ is lower, the effect of clustering is larger.
The reason I thought is that the lower
$γ$ hints that the higer
$σ$ of the Gaussion distribution ( i.e., In this work, this distribution is the distance between the point and the center of the cluster. ), but too lower
$γ$ ( i.e., higher
$σ$ ) may cause underfitting.
Traditional v.s. K-means++
- Both two image are
  $γ = 10^{- 4}$ :
  
  Traditional:
  
  K-means++:
  
  Traditional:
  
  K-means++:
At first, I thought that the K-mean++ just more faster than the tradition according the number of iteration ( shown at Screen shot part ). But when the result of image2 produced, we could see that the sky area and the rabbit could be clustered in the different when k=6!
So I believe that Kmeans++ is more powerful in this work.

Spectral Clustering

Sepctral clustering groups data by similarity graph which could be computed via kernel function.
In the similarity graph, if two data are more similarity, then their distance ( i.e., weight of edge ) is larger. For example, if we compute similarity via Gaussian distribution ( i.e., rbf kernel ), the weight of edge illustrate that how likely are the two points.

But how do we group the data through similarity graph?
From the rich theorm supported, we could use Graph Lapacian matrix, Rayleigh quotient, and minimize the cut cost form similarity graph.

The cut cost is defined as follow:

c u t (A, B) = \sum_{i \in A, j \in B} w_{i j}

But this might result in the unbalance numbers of cluster

A

and cluster

B

, so the two varies of cut cost are defined as follow:

\begin{aligned} R a t i o c u t (A, B) & = \sum_{i \in A, j \in B} w_{i j} (\frac{1}{v o l (A)} + \frac{1}{v o l (B)}) \\ N o r m a l c u t (A, B) & = \sum_{i \in A, j \in B} w_{i j} (\frac{1}{| A |} + \frac{1}{| B |}) \end{aligned}

And the name of spetral clustering are also differrent, one is called Unnormalize another is called Normalize.

The different not only relate to the cut categories, but also the Graph Lapacian matrix ( denote as

L

\begin{aligned} U n n o r m a l i z e d & : L = D - W \\ N o r m a l i z e & : L = D^{\frac{- 1}{2}} (D - W) D^{\frac{1}{2}} \end{aligned}

Then we could compute the eigenvector and perform k-means algorithm on matrix

H

. Assume the eigenvectors is denoted as matrix

T

and it contains only corresbonding to the first k eigenvalues ( except the first ):

\begin{aligned} U n n o r m a l i z e d & : H = T \\ N o r m a l i z e & : H = D^{\frac{- 1}{2}} T \end{aligned}

The Work

kernel function:
$e^{- γ_{1} | | S (x) - S (x^{'}) | |^{2}} \times e^{- γ_{2} | | C (x) - C (x^{'}) | |^{2}}$
Input data: Two 100*100 images

Cause that computing the eigenvalues and eigenvectors from a
$10^{4} \times 10^{4}$ similarity matrix are too time consumimg, I only perform the result of k = 3 and k = 4.

Step 1

Prepare image data for precompute similarity matrix (kernel) and compute similarity matrix.

This step is the same to Kernel K-means Step 1 and Step 2 two part, so I just ignore here.

Step 2

Compute eigenvalues and eigenvectors and sort the eigenvectors according to the first k eigenvalues.

class spectral_clustering():
    def __init__(self, data_similarity , \
            k=2, \
            normalize=False, \
            keep_log=False):
        self.k = k
        self.W = data_similarity
        self.D = np.sum(data_similarity, axis=1, keepdims=True) * np.eye(data_similarity.shape[0])
        self.normalize = normalize
        self.keep_log = keep_log

    def __eig(self, A):
        if self.normalize:
            sqrt_D = np.sqrt(self.D)
            neg_sqrt_D = np.linalg.inv(sqrt_D)
            N = np.matmul(np.matmul(neg_sqrt_D, A), sqrt_D)
            eigenvals, eigenvecs = np.linalg.eig(N)
            eigenvecs = np.matmul(neg_sqrt_D, eigenvecs.real)
            return eigenvals, eigenvecs
        else:
            return np.linalg.eig(A)

    def __get_sorted_k_eigen(self, A, k):
        eigenvalues, eigenvectors = self.__eig(A)

        sorted_idx = np.argsort(eigenvalues)
        sorted_eigenvalues = []
        sorted_eigenvectors = []
        for i in range(k):
            vector = eigenvectors[:, sorted_idx[i]]
            sorted_eigenvectors.append(vector[:, None])
            sorted_eigenvalues.append(eigenvalues[sorted_idx[i]])        
        sorted_eigenvalues = np.array(sorted_eigenvalues)
        sorted_eigenvectors = np.concatenate(sorted_eigenvectors, axis=1)
    
        return sorted_eigenvalues, sorted_eigenvectors       

    def run(self):
        self.L = self.D - self.W
        k_eigenvalues, k_eigenvectors = self.__get_sorted_k_eigen(self.L, self.k)
        km = kmeans(k_eigenvectors, k=self.k, keep_log=self.keep_log)
        return km.run(), k_eigenvalues, k_eigenvectors

Step 3

Perform K-means algorithm.

The K-means algorithm and method is discussed above, I just call the same python-code I written.

def run(self):
    self.L = self.D - self.W
    k_eigenvalues, k_eigenvectors = self.__get_sorted_k_eigen(self.L, self.k)
    km = kmeans(k_eigenvectors, k=self.k, keep_log=self.keep_log)
    return km.run(), k_eigenvalues, k_eigenvectors

Step 4

Examine whether the data points within the same cluster do have the same coordinates in the eigenspace of graph Laplacian or not.

def reorder_by_cluster(c):
    new_order = np.array([])
    for i in range(c.shape[1]):
        new_order = np.append(new_order, np.where(c[:,i]==1)[0])
    new_order = new_order.astype('int32')
    return new_order

def show_vectors_by_clusters(vectors, clusters, fig_path):
    reorder_idx = reorder_by_cluster(clusters)
    num_cluster = np.sum(clusters, axis=0, dtype=int)

    iter_idx = 0
    for k in range(clusters.shape[1]):
        plt.subplot(1, clusters.shape[1], k+1)
        for j in range(num_cluster[k]):
            plt.plot(vectors[reorder_idx[iter_idx], :])
            plt.title('cluster '+str(k+1))
            iter_idx += 1
    plt.savefig(fig_path)
    plt.show(block = False)
    plt.pause(1)
    plt.close()

Step 5

Visualize the result of spectral clusering.

This step is also the same to Kernel K-means Step 4 part.

Screen Shot

The output log is too long as Kernel K-means, so I just upload part of the output.

Result and The Discussion

image1 ( k = 4, 5 )
- Normalize:
- eigenvectors:
- Unnormalize:
- eigenvectors:
image2 ( k = 4, 5 )
- Normalize:
- eigenvectors:
- Unnormalize:
- eigenvectors:

According to the result shown above, we could not say that the data points within the same cluster do have the same coordinates in the eigenspace of graph Laplacian matrix

L

, they are different, but similar! K-means would group the similar vector to the same group.

Due to the result, the unormalize seems to perform better than normalize.

2020 Machine Learning Homework 6

tags: NCTU, Machine Learning, Kernel K-means, Spectral Clustering, GitHub

Kernel K-means

The Work

Step 1

Step 2

Step 3

Step 4

Screen Shot

Result and The Discussion

Spectral Clustering

The Work

Step 1

Step 2

Step 3

Step 4

Step 5

Screen Shot

Result and The Discussion

Read more

NCTU Machine Learning

LINUX 核心設計 課程用書筆記

Play in a Sandbox!?

2020 Machine Learning Homework 7

tags: `NCTU`, `Machine Learning`, `Kernel K-means`, `Spectral Clustering`, `GitHub`

LINUX 核心設計課程用書筆記