Try   HackMD

2020 Machine Learning Homework 6

School: National Chiao Tung University
Name: Shao-Wei Chiu

tags: NCTU, Machine Learning, Kernel K-means, Spectral Clustering, GitHub

Note: This report was made on Hackmd.io and restricted by the .pdf format, the .gif animation would not display. Please view it on https://hackmd.io/@swchiu/BJwOjuc3U, thanks.

Kernel K-means

Kernel k-means is an approach to k-means algorithm, but mapping the data into higher degree dimentions.
And the mapping-function called Kernel.

K-means algorithm is that after comparing the data similarity, we cluster the more similarity datas to the same group.
K means that there are k number of group to cluster.

For the regular K-means, we use the following formula to compare and cluster data.

arg min(C1,μ1)...(Ck,μk)i=1kxjCi||xjμi||2

How about kernel k-means?

we transfrom the data to the higher degree, and do the same k-means algorithm on it.

arg min(C1,μ1ϕ)...(Ck,μkϕ)i=1kϕ(xj)Ci||ϕ(xj)μiϕ||2

||ϕ(xj)μiϕ||2=ϕT(xj)ϕ(xj)2ϕT(xj)1|Ck|xnckϕ(xn)+1|Ck|2xpCkxqCkϕT(xp)ϕ(xq)=K(xj,xj)2|Ck|xnckK(xj,xn)+1|Ck|2xpCkxqCkK(xp,xq)

Instead of update

μk in the k-means algorithm, update only
Ck
in the kernel k-means algorithm.

The Work

  • kernel function:
    eγ1||S(x)S(x)||2×eγ2||C(x)C(x)||2
  • Input data: Two 100*100 images

Step 1

Prepare image data for precompute gram matrix (kernel)

def img_formater(img):
    n = img.shape[0]*img.shape[1]
    spatial_data = []
    color_data = []
    for i in range(img.shape[0]):
        for j in range(img.shape[1]):
            spatial_data.append([i, j])
            color_data.append(img[i][j])
    return np.array(spatial_data), np.array(color_data, dtype=int)
    
img = imageio.imread(img_path)
spatial_data, color_data = img_formater(img)

Step 2

Compute gram matrix

At the first, I implement this part by for-loop which is a trivial way.
But I found that it is too time-consuming! Because there are

104 data points for our input data, and we need to calculate a
104×104
gram matrix.
So I replace my for-loop implementation into matrix-computation for efficient.

euclidean2=||uv||2=(uv)T(uv)=||u||22uTv+||v||2

Above formula is suitable for vector which is stroed in

d×1 matrix.
But in this work, the vectors are stored in
1×d
matrix, so I applied the following formula revised.

euclidean2=||uv||2=(uv)(uv)T=||u||22uvT+||v||2

And for the performance, I take all vectors into one matrix

D which is
n×d
for calculate gram matrix
G
which is
n×n
, let
E
as the euclidean matrix

D=[V1V2...Vn1Vn]n×d,Vi=[vi1vi2...vid]1×dE=[||V1,V1||2||V1,V2||2...||V1,Vn||2||V2,V1||2||V2,V2||2...||V2,Vn||2::...:||Vn,V1||2||Vn,V2||2...||Vn,Vn||2]=[||V1||22V1V1T+||V1||2...||V1||22V1VnT+||Vn||2:...:||Vn||22VnV1T+||V1||2...||Vn||22VnVnT+||Vn||2]=[||V1||2...||V1||2||V2||2...||V2||2:...:||Vn||2...||Vn||2]2DDT+[||V1||2...||V1||2||V2||2...||V2||2:...:||Vn||2...||Vn||2]T=D2[11...1::...:11...1]d×n2DDT+[11...1::...:11...1]n×d(DT)2

So we could calculate

G as
G=eE
without for-loop.

#Origianl trvial implementation
def rbf_img(u, v, g=0.0001):
    s_dis = scipy.spatial.distance.euclidean(u[0], v[0])
    c_dis = scipy.spatial.distance.euclidean(u[1], v[1])
    return math.exp(-1*g*s_dis**2 - g*c_dis**2)

def gram_matrix(data, path, kernel=rbf_img):
    gram = np.ones((len(data), len(data)))
    for i in range(len(data)):
        for j in range(i, len(data)):
            gram[i][j] = kernel(data[i], data[j])
            gram[j][i] = gram[i][j]
    return gram
    
#---------------------------------------------------------------------#  

#New implementation using matrix computation
def euclidean(self, u, v):
    #This method is defined in kmeans.euclidean
    return np.matmul(u**2, np.ones((u.shape[1],v.shape[0]))) \
        -2*np.matmul(u, v.T) \
        +np.matmul(np.ones((u.shape[0], v.shape[1])), (v.T)**2)

def rbf(u, v, g=10**-4):
    return np.exp(-1*g*kmeans.euclidean(kmeans, u, v))
    
gram = rbf(spatial_data, spatial_data) \ 
    * rbf(color_data, color_data)

Step 3

Run Kernel K-means, recall the formula:

Let sjk=||ϕ(xj)μkϕ||2=K(xj,xj)2|Ck|xnckK(xj,xn)+1|Ck|2xpCkxqCkK(xp,xq)
we compare for the
||ϕ(xj)μkϕ||2
for measuring the distance between the
kth
cluster and the
jth
mapped data point at every vector
xj
on cluster
Ck
, and for the every
Ck
, the first terms are the same, so we could ignore it directly. I still use matrix computaion on this part, and I would explain the procedure of my derivation as below.

For each

xj, we have
k
values of corresponding to the
kth
cluster ( denoted by
sjk
), and we would go throgh all datas, which means that we have
n×k
values in totally and it is suitable for matrix computation!
Let
Sn×k
is the distance matrix as mentioned ( which is dis in the code segment ).

sjk=2|Ck|xnckK(xj,xn)+1|Ck|2xpCkxqCkK(xp,xq)=2|Ck|[K(xj,x1)...K(xj,xn)]Ck+1|Ck|2CkTGCk

sjk is the
jth
row
kth
col element of
Sn×k
, means the distance between the
jth
mapped data point and the
kth
cluster.

I want to do a matrix computation instead of

n times computation at each data point. So I need expand the above euation to from
1×1
to
n×k
.

Sn×k=2|C|[K(x1,x1)K(x1,x2)...K(x1,xn)K(x2,x1)K(x2,x2)...K(x2,xn)::...:K(xn,x1)K(xn,x2)...K(xn,xn)]n×nCn×k+1|C|2[11...1::...:11...1]n×kCn×kTGn×nCn×kSn×k=2|C|Gn×nCn×k+1|C|2[1...1:...:1...1]n×kCn×kTGn×nCn×k
For the second term, we only need the diagonal elements, so in the python code, I multiply use a diagonal identity matrix np.eye().

And there are many varies of initialization method, e.g. k-means++ which decide the initial cluster by maxmizing their distance. I would compare k-means++ and the traditional way at Result part.

def __get_init(self, method):
        if method == 'kmeans++':
            return self.__kmeanspp
        elif method == 'default':
            return self.__traditional
        else:
            print('ERROR: \'{}\' is not a pre-defined initialize method'.format(method))
            return exit(0)

def __kmeanspp(self):
    n, d = self.data.shape
    centers = np.array([self.data[np.random.randint(n), :d]])
    for i in range(self.k-1):
        dist = self.euclidean(self.data, centers)
        dist = np.min(dist, axis=1)
        next_center = np.argmax(dist, axis=0)
        centers = np.vstack((centers, self.data[next_center, :]))
    return centers

def __traditional(self):
    return np.array(self.data[np.random.choice(self.data.shape[0], size=self.k, replace=False), :])

def __kernel_trick(self, gram, ck):
    c_count = np.sum(ck, axis=0)
    dist = -2*np.matmul(gram, ck)/c_count + \
        np.matmul(np.ones(ck.shape), (np.matmul(ck.T, np.matmul(gram, ck)))*np.eye(ck.shape[1]))/(c_count**2)
    return dist

def run(self):
    #initial cluster
    centers = self.init()
    dist = self.euclidean(self.data, centers)
    ck = np.zeros((self.data.shape[0], self.k))
    ck[np.arange(dist.shape[0]), np.argmin(dist, axis=1)] = 1

    record = []
    record.append(ck)
    record_iter = 0
    for i in range(self.max_iter):
        #E-step
        if self.is_kernel:
            dist = self.__kernel_trick(self.data, ck)
        else:
            dist = self.euclidean(self.data, centers)

        #M-step
        update_ck = np.zeros(dist.shape)
        update_ck[np.arange(dist.shape[0]),np.argmin(dist, axis=1)] = 1
        delta_ck = np.count_nonzero(np.abs(update_ck - ck))
        update_centers = np.matmul(update_ck.T, self.data)/np.sum(update_ck, axis=0, keepdims=True).T

        record.append(update_ck)
        if delta_ck == 0:
            self.converge -= 1
            if self.converge == 0:
                record_iter = i+1            
                if self.keep_log == False:
                    break

        ck = update_ck
        centers = update_centers
    return record, record_iter

Step 4

Visualization

k_visual = colors.to_rgba_array(['tab:blue', \ 
                        'tab:orange', \ 
                        'tab:green', \ 
                        'tab:red', \ 
                        'tab:purple', \
                        'tab:brown'])
def visualizer(record, save_path, figsize=(100,100,4)):
    gif = []
    for i in range(len(record)):
        c_id = np.argmax(record[i], axis=1)
        img = np.zeros(figsize, dtype=np.uint8)
        for j in range(c_id.shape[0]):
            m, n = (int(j/100), int(j%100))
            img[m][n] = 255*k_visual[c_id[j]]
        gif.append(img)
    imageio.mimsave(save_path, gif)
        
def merge_gifs(gifs, max_fram, id, method):
    gif = []
    for i in range(len(gifs)):
        gif.append(imageio.get_reader('output/'+gifs[i]))

    new_gif = imageio.get_writer('output/image'+str(id)+'_'+method+'.gif')
    
    if max_fram + 5 < 100:
        max_fram += 5
    for frame_number in range(max_fram):
        img = []
        for i in range(len(gif)):
            img.append(gif[i].get_next_data())
        new_image = np.hstack(img)
        new_gif.append_data(new_image)
    for i in range(len(gif)):
        gif[i].close()
    new_gif.close()

Screen Shot


Result and The Discussion

  • image1 (k=2, 3, 4, 5, 6)

    Original image:
    Origin

    γ=106:

    γ=105
    :

    γ=104
    :

    γ=103
    :

    γ=102
    :

    γ=101
    :

  • image2 (k=2, 3, 4, 5, 6)

    Origin

    γ=106:

    γ=105
    :

    γ=104
    :

    γ=103
    :

    γ=102
    :

    γ=101
    :

    For the value of

    γ parameter in the RBF kernel, I found that when
    γ
    is lower, the effect of clustering is larger.
    The reason I thought is that the lower
    γ
    hints that the higer
    σ
    of the Gaussion distribution ( i.e., In this work, this distribution is the distance between the point and the center of the cluster. ), but too lower
    γ
    ( i.e., higher
    σ
    ) may cause underfitting.

  • Traditional v.s. K-means++

    • Both two image are

      γ=104:

      Traditional:

      K-means++:


      Traditional:

      K-means++:

    At first, I thought that the K-mean++ just more faster than the tradition according the number of iteration ( shown at Screen shot part ). But when the result of image2 produced, we could see that the sky area and the rabbit could be clustered in the different when k=6!
    So I believe that Kmeans++ is more powerful in this work.

Spectral Clustering

Sepctral clustering groups data by similarity graph which could be computed via kernel function.
In the similarity graph, if two data are more similarity, then their distance ( i.e., weight of edge ) is larger. For example, if we compute similarity via Gaussian distribution ( i.e., rbf kernel ), the weight of edge illustrate that how likely are the two points.

But how do we group the data through similarity graph?
From the rich theorm supported, we could use Graph Lapacian matrix, Rayleigh quotient, and minimize the cut cost form similarity graph.

The cut cost is defined as follow:

cut(A,B)=iA,jBwij
But this might result in the unbalance numbers of cluster
A
and cluster
B
, so the two varies of cut cost are defined as follow:
Ratio cut(A,B)=iA,jBwij(1vol(A)+1vol(B))Normal cut(A,B)=iA,jBwij(1|A|+1|B|)

And the name of spetral clustering are also differrent, one is called Unnormalize another is called Normalize.

The different not only relate to the cut categories, but also the Graph Lapacian matrix ( denote as

L ).
Unnormalized:L=DWNormalize:L=D12(DW)D12

Then we could compute the eigenvector and perform k-means algorithm on matrix
H
. Assume the eigenvectors is denoted as matrix
T
and it contains only corresbonding to the first k eigenvalues ( except the first ):
Unnormalized:H=TNormalize:H=D12T

The Work

  • kernel function:
    eγ1||S(x)S(x)||2×eγ2||C(x)C(x)||2
  • Input data: Two 100*100 images

Cause that computing the eigenvalues and eigenvectors from a

104×104 similarity matrix are too time consumimg, I only perform the result of k = 3 and k = 4.

Step 1

Prepare image data for precompute similarity matrix (kernel) and compute similarity matrix.

This step is the same to Kernel K-means Step 1 and Step 2 two part, so I just ignore here.

Step 2

Compute eigenvalues and eigenvectors and sort the eigenvectors according to the first k eigenvalues.

class spectral_clustering():
    def __init__(self, data_similarity , \
            k=2, \
            normalize=False, \
            keep_log=False):
        self.k = k
        self.W = data_similarity
        self.D = np.sum(data_similarity, axis=1, keepdims=True) * np.eye(data_similarity.shape[0])
        self.normalize = normalize
        self.keep_log = keep_log

    def __eig(self, A):
        if self.normalize:
            sqrt_D = np.sqrt(self.D)
            neg_sqrt_D = np.linalg.inv(sqrt_D)
            N = np.matmul(np.matmul(neg_sqrt_D, A), sqrt_D)
            eigenvals, eigenvecs = np.linalg.eig(N)
            eigenvecs = np.matmul(neg_sqrt_D, eigenvecs.real)
            return eigenvals, eigenvecs
        else:
            return np.linalg.eig(A)

    def __get_sorted_k_eigen(self, A, k):
        eigenvalues, eigenvectors = self.__eig(A)

        sorted_idx = np.argsort(eigenvalues)
        sorted_eigenvalues = []
        sorted_eigenvectors = []
        for i in range(k):
            vector = eigenvectors[:, sorted_idx[i]]
            sorted_eigenvectors.append(vector[:, None])
            sorted_eigenvalues.append(eigenvalues[sorted_idx[i]])        
        sorted_eigenvalues = np.array(sorted_eigenvalues)
        sorted_eigenvectors = np.concatenate(sorted_eigenvectors, axis=1)
    
        return sorted_eigenvalues, sorted_eigenvectors       

    def run(self):
        self.L = self.D - self.W
        k_eigenvalues, k_eigenvectors = self.__get_sorted_k_eigen(self.L, self.k)
        km = kmeans(k_eigenvectors, k=self.k, keep_log=self.keep_log)
        return km.run(), k_eigenvalues, k_eigenvectors

Step 3

Perform K-means algorithm.

The K-means algorithm and method is discussed above, I just call the same python-code I written.

def run(self):
    self.L = self.D - self.W
    k_eigenvalues, k_eigenvectors = self.__get_sorted_k_eigen(self.L, self.k)
    km = kmeans(k_eigenvectors, k=self.k, keep_log=self.keep_log)
    return km.run(), k_eigenvalues, k_eigenvectors

Step 4

Examine whether the data points within the same cluster do have the same coordinates in the eigenspace of graph Laplacian or not.

def reorder_by_cluster(c):
    new_order = np.array([])
    for i in range(c.shape[1]):
        new_order = np.append(new_order, np.where(c[:,i]==1)[0])
    new_order = new_order.astype('int32')
    return new_order

def show_vectors_by_clusters(vectors, clusters, fig_path):
    reorder_idx = reorder_by_cluster(clusters)
    num_cluster = np.sum(clusters, axis=0, dtype=int)

    iter_idx = 0
    for k in range(clusters.shape[1]):
        plt.subplot(1, clusters.shape[1], k+1)
        for j in range(num_cluster[k]):
            plt.plot(vectors[reorder_idx[iter_idx], :])
            plt.title('cluster '+str(k+1))
            iter_idx += 1
    plt.savefig(fig_path)
    plt.show(block = False)
    plt.pause(1)
    plt.close()

Step 5

Visualize the result of spectral clusering.

This step is also the same to Kernel K-means Step 4 part.

Screen Shot

The output log is too long as Kernel K-means, so I just upload part of the output.

Result and The Discussion

  • image1 ( k = 4, 5 )

    • Normalize:

    • eigenvectors:
    • Unnormalize:

    • eigenvectors:
  • image2 ( k = 4, 5 )

    • Normalize:

    • eigenvectors:

    • Unnormalize:

    • eigenvectors:

According to the result shown above, we could not say that the data points within the same cluster do have the same coordinates in the eigenspace of graph Laplacian matrix

L, they are different, but similar! K-means would group the similar vector to the same group.

Due to the result, the unormalize seems to perform better than normalize.