Try   HackMD

2020 Machine Learning Homework 7

School: National Chiao Tung University
Name: Shao-Wei ( Willy ) Chiu

tags: NCTU, Machine Learning, Embedding, PCA, LDA, SNE, t-SNE, GitHub

Note: This report was made on Hackmd.io and restricted by the .pdf format, the .gif animation would not display. Please view it on https://hackmd.io/@swchiu/SJQehejA8, thanks.

PCA / LDA

When dataset's feature space grows into very high dimensions, there would be some useless feature. We prefer to extract the more important features for our dataset.

let

xi=[xi1xi2...xid]Xn×d=[x1x2...xn]

PCA (PCA.py)

PCA is a method to find a project matrix

Wd×k, which could project data
Xn×d
into
Zn×k
, i.e.,
Zn×k=Xn×dWd×k

and
Wd×k
let
Zn×k
has maximal variance, which hints that
Zn×k
also has minimal square error between
Xn×d

argmaxW(σ2(Z))=1n(ziZ)T(ziZ)=1n(xiwixiwi)T(xiwixiwi)=1n((xixi)wi)T((xixi)wi)=WT1n((xixi)T(xixi))W=WTSW ( diagonal value )

By Rayleigh Quotient, the first k large eigenvector of

S is our
W
, ( i.e., solving
Sv=λv
)

Look at

Sd×d matrix, it is time consuming in solving its eigen-problem when
d
is very large. We could do a little trick on it.

Let

ϕ(X)=XXCn×n=1n(XX)(XX)T=1nϕ(X)ϕ(X)TSd×d=1n(XX)T(XX)=1nϕ(X)Tϕ(X)
Then, solving
Cv=λvϕ(X)TCv=λϕ(X)Tvϕ(X)T1nϕ(X)ϕ(X)Tv=λϕ(X)Tv(1nϕ(X)Tϕ(X))ϕ(X)Tv=λϕ(X)TvS(ϕ(X)Tv)=λ(ϕ(X)Tv)SW=λW

So, we could solve eigen-problem in

n×n dimension to get
W=ϕ(X)TV=(XX)TV
.

Then, we could project data such like

XW and
XtestW
.

def __get_covariance(self, datas):
        n = datas.shape[0]
        if self.is_kernel:
            ...
        else:
            # This part use a little trick
            S = datas - np.sum(datas, axis=0) / n
            S = np.matmul(S, S.T) / n
            return S
...
def run(self):
        S = self.__get_covariance(self.datas)
        e_vals, e_vecs = self.__get_sorted_eigen(S, self.k)
        mean_datas = self.datas - np.sum(
            self.datas, axis=0) / self.datas.shape[0]
        if self.is_kernel:
            ...
        else:
            W = np.matmul(mean_datas.T, e_vecs)
            W /= np.sqrt(e_vals)
            pca_space = np.matmul(self.datas, W)
        return pca_space, W
...
pca = PCA(train_faces, k=25)
train_space, W_train = pca.run()
test_space = np.matmul(test_faces, W_train)
...

kernel PCA (PCA.py)

At the first, mapping datas into feature space.

Xn×dϕ(X)n×m
and then try to calculate the covariance as PCA do.
cov(ϕ(X))=1n(ϕ(X)1nϕ(X))T(ϕ(X)1nϕ(X)),( let Ψ(X)=ϕ(X)1nϕ(X))=1nΨ(X)TΨ(X)=S

We want to solve

SW=λW, but
Ψ(X)
is not explict.

Assume

X is
d×n
,
Sw=λw1nXXTw=λw1nλxi(xiTw)=w[1nλ(xiTw)]xi=wαixi=wXA=w

We could observ that the eigenvector is linear combinition of
X
, so we could rewrite the eigen-problem as following:
( here
X
is
n×d
)

1nΨ(X)TΨ(X)w=λw1nΨ(X)TΨ(X)Ψ(X)TA=λΨ(X)TA1nΨ(X)TΨ(X)Ψ(X)TA=λΨ(X)TA1nΨ(X)Ψ(X)TΨ(X)Ψ(X)TA=λΨ(X)Ψ(X)TA1nKcKcA=λKcA1nKcA=λA

What is

Kc ?
let K=ϕ(X)ϕ(X)TKc=Ψ(X)Ψ(X)T=(ϕ(X)1nϕ(X))(ϕ(X)1nϕ(X))T=ϕ(X)ϕ(X)T(1nϕ(X))ϕ(X)Tϕ(X)(1nϕ(X)T)+1n2ϕ(X)ϕ(X)T=KK1n1nK+1nK1n, ( 1n is a n×n matrix whose elements are all 1n )

def __get_covariance(self, datas):
        n = datas.shape[0]
        if self.is_kernel:
            N1 = np.ones((n, n)) / n
            S = (datas - np.matmul(N1, datas) - np.matmul(datas, N1) +
                 np.matmul(np.matmul(N1, datas), N1))
            return S
        else:
            ...

How to project data?

ϕ(X)W=ϕ(X)Ψ(X)TA=ϕ(X)(ϕ(X)1nϕ(X))TA=(ϕ(X)ϕ(X)Tϕ(X)(1nϕ(X)T))A=(K1nK)A

def run(self):
        S = self.__get_covariance(self.datas)
        e_vals, e_vecs = self.__get_sorted_eigen(S, self.k)
        mean_datas = self.datas - np.sum(
            self.datas, axis=0) / self.datas.shape[0]
        if self.is_kernel:
            W = e_vecs
            W /= np.sqrt(e_vals)
            N1 = np.ones(self.datas.shape) / self.datas.shape[0]
            pca_space = np.matmul((self.datas - np.matmul(N1, self.datas)), W)
        else:
            ...
            
...

K_train = kernel(train_faces, train_faces)
K_test_train = kernel(test_faces, train_faces)
kpca = PCA(K_train, k=25, is_kernel=True)
train_space, alpha = kpca.run()

How about projecting new data

Xtest ?
ϕ(Xtest)W=ϕ(Xtest)Ψ(X)TA=ϕ(Xtest)(ϕ(X)1nϕ(X))TA=(ϕ(Xtest)ϕ(X)Tϕ(Xtest)(1nϕ(X)T))A=(Ktest_train1n:m×nK)A

K_train = kernel(train_faces, train_faces)
K_test_train = kernel(test_faces, train_faces)
kpca = PCA(K_train, k=25, is_kernel=True)
train_space, alpha = kpca.run()
NM1 = np.ones(K_test_train.shape) / K_train.shape[0]
### Project testing data at next line
test_space = np.matmul(K_test_train - np.matmul(NM1, K_train), alpha)

LDA (LDA.py)

Not like PCA is projecting data into eigen-features space. LDA projecting data for maximize the distance between different labels/clusters, and minimize the distance whthin labels/clusters.

solving project matrix

W as eigen-problem below:
SW1SBw=λw, whereSW=j=1kSj, Sj=(ximj)(ximj)T, mj=1njiCjxiSB=j=1kSBj=j=1knj(mjm)(mjm)T

But

SW might singular when
n<d
, So We could find its pseudo inverse, instead.
In our work,
n
is actually smaller than
d
, so I find pseudo inverse directly.

class LDA():
    def __init__(self, datas, labels, k, is_kernel=False):
        self.datas = datas
        self.n = datas.shape[0]
        self.labels = labels
        self.k = k
        self.is_kernel = is_kernel

    def __count_labels(self):
        C = np.zeros((self.n, len(np.unique(self.labels))))
        for idx, j in enumerate(np.unique(self.labels)):
            C[self.labels == j, idx] = 1
        return C

    def __get_Sb_Sw(self, C):
        Mj = np.matmul(self.datas.T, C) / np.sum(C, axis=0)
        M = np.sum(self.datas.T, axis=1) / self.datas.shape[0]

        B = Mj - M[:, None]
        Sb = np.matmul(B * np.sum(C, axis=0), B.T)

        W = self.datas.T - np.matmul(Mj, C.T)
        Sw = np.zeros(Sb.shape)
        for group in np.unique(self.labels):
            w = W[:, self.labels == group]
            Sw += (np.matmul(w, w.T) / w.shape[1])
        return Sb, Sw

    def __get_sorted_eigen(self, A, k):
        eigenvalues, eigenvectors = np.linalg.eigh(A)
        sorted_idx = np.flip(np.argsort(eigenvalues))
        sorted_eigenvalues = []
        sorted_eigenvectors = []
        for i in range(k):
            vector = eigenvectors[:, sorted_idx[i]]
            sorted_eigenvectors.append(vector[:, None])
            sorted_eigenvalues.append(eigenvalues[sorted_idx[i]])
        sorted_eigenvalues = np.array(sorted_eigenvalues)
        sorted_eigenvectors = np.concatenate(sorted_eigenvectors, axis=1)

        return sorted_eigenvalues, sorted_eigenvectors

    def run(self):
        C = self.__count_labels()
        Sb, Sw = self.__get_Sb_Sw(C)
        #try:
        #    Sw_inv = np.linalg.inv(Sw)
        #except:
        #    Sw_inv = np.linalg.pinv(Sw)
        Sw_inv = np.linalg.pinv(Sw)
        obj_matrix = np.matmul(Sw_inv, Sb)
        value, vector = self.__get_sorted_eigen(obj_matrix, self.k)
        return np.matmul(self.datas, vector), vector

Kernel LDA (LDA.py)

Referrence from wikipedia.

But I could not map testing data to the training's space correctly until the deadline coming, so my accurancy of face-reconition is low

K-Nearest Neighbor

Decide the data belone to the cluster which has the largest numbers datas in its k-nearest neighbors.

import numpy as np


class KNN():
    def __init__(self, train_datas, test_datas, labels, k=5):
        self.train_datas = train_datas
        self.test_datas = test_datas
        self.labels = labels
        self.k = k

    def __eucidiance(self, U, V):
        return np.matmul(U**2, np.ones(
            (U.shape[1], V.shape[0]))) - 2 * np.matmul(U, V.T) + np.matmul(
                np.ones((U.shape[0], V.shape[1])), (V.T)**2)

    def run(self):
        dist = self.__eucidiance(self.test_datas, self.train_datas)
        closet = np.argsort(dist, axis=1)

        y = []
        for i in range(self.test_datas.shape[0]):
            for j in range(self.k):
                y.append(self.labels[closet[i][j]])
        y = np.array(y)
        y = y.reshape(self.test_datas.shape[0], self.k)

        count = []
        for i in range(len(self.labels)):
            count.append(np.count_nonzero(y == self.labels[i], axis=1))
        count = np.vstack(count)
        y = np.argmax(count, axis=0)
        predict = [None] * self.test_datas.shape[0]
        for i in range(self.test_datas.shape[0]):
            predict[i] = self.labels[y[i]]

        return predict
        
...

knn = KNN(train_space, test_space, train_labels, k=5)
predict = knn.run()

Kernel ( in util.py )

I use three different kernel, linear, polynomial and rbf.

def euclidean(u, v):
    return np.matmul(u**2, np.ones(
        (u.shape[1], v.shape[0]))) - 2 * np.matmul(u, v.T) + np.matmul(
            np.ones((u.shape[0], v.shape[1])), (v.T)**2)


def rbf(u, v, g=10**-11):
    return np.exp(-1 * g * euclidean(u, v))


def linear(u, v):
    return np.matmul(u, v.T)


def polynomial(u, v, g=0.7, coef0=10, d=5):
    return ((g * np.matmul(u, v.T)) + coef0)**d

Other Methods in My Implementation ( in util.py )

read_face method read the dataset and resize the image into 100*100.
show_face method combine each face matrix in column and the row for visualize.

def read_faces(dir_path):
    faces = []
    labels = []
    for filename in os.listdir(dir_path):
        with PIL.Image.open(dir_path + filename) as im:
            im = im.resize((100, 100), PIL.Image.BILINEAR)
            faces.append([np.array(im).reshape(1, -1)])
            labels.append(filename.split('.', 1)[0])
    faces = np.concatenate(faces, axis=0)
    faces = faces.reshape(faces.shape[0], faces.shape[2])
    faces = faces.astype('int64')
    return faces, labels


def show_faces(faces, filename=None, col=5):
    #f = faces.reshape(-1,231,195)
    f = faces.reshape(-1, 100, 100)
    n = f.shape[0]
    all_faces = []
    for i in range(int(n / col)):
        all_faces.append([np.concatenate(f[col * i:col * (i + 1)], axis=1)])

    all_faces = np.concatenate(all_faces[:], axis=1)
    all_faces = all_faces.reshape(all_faces.shape[1], all_faces.shape[2])

    plt.figure(figsize=(1.5 * col, 1.5 * n / col))
    plt.title(filename)
    plt.imshow(all_faces, cmap='gray')
    plt.subplots_adjust(left=0.05, right=0.95, top=0.85, bottom=0.15)

    if (filename):
        plt.savefig('./output/' + filename)
    else:
        plt.show()

Result and Discussion

  1. The first 25 eigenfaces and fisherfaces

  2. Reconstruct random 10 faces


  3. and 4. Doing face recognition via PCA/KPCA and LDA/KLDA

    ​​​​(base) swchiu@gpuserv1:~/Embedding$ python3 main.py
    ​​​​PCA
    ​​​​Face-reconition accuracy: 27/30 = 90.00%
    
    ​​​​Kernel PCA
    ​​​​(rbf) Face-reconition accuracy: 27/30 = 90.00%
    ​​​​(polynomial) Face-reconition accuracy: 25/30 = 83.33%
    ​​​​(linear) Face-reconition accuracy: 27/30 = 90.00%
    
    ​​​​LDA
    ​​​​Face-reconition accuracy: 28/30 = 93.33%
    
    ​​​​Kernel LDA
    ​​​​(rbf) Face-reconition accuracy: 24/30 = 80.00%
    ​​​​(polynomial) Face-reconition accuracy: 18/30 = 60.00%
    ​​​​(linear) Face-reconition accuracy: 18/30 = 60.00%
    

    LDA is better than PCA, I think the reason is that LDA could split the data more widly in its concept, but PCA only to mapping data for preserve the maximum variance.
    In this case, we use KNN to classify data, which is based on euclidiean distance, and LDA could split widly in euclidiean distance.

    RBF and Linear looks similar in KPCA, and RBF performs best in KLDA.
    Polynomial performs worst in both KPCA and KLDA.

    But the results of KLDA were not in my expectation. I think the problem is that I use the wrong mapping way to testing data. So I tried to derive the formula on my own, but I could not make a sence until the deadline coming.

t-SNE / Symmetric SNE

Referrence to the https://lvdmaaten.github.io/tsne/code/tsne_python.zip and modify from it.

Modify t-SNE into symmetric SNE

t-SNE using a different distribution on the reducted data ( i.e, student t-distribution, which is a more widly distribution than Gaussion ).

Because t-SNE use a more widly distribution, the crowded problem would be sloved.

We could see that when the data is more closed, the distance in t-SNE is more far.

The crowded problem which symmetric SNE would face, I will show at the result part below.

SNE's concept is that it want to preserve the probabilitic distribution after the reduction.

Look at the definition of probability of symmetreic SNE and t-SNE,

symmetric SNE:pij=exp(||xixj||2/(2σ2))kiexp(||xlxk||2/(2σ2))qij=exp(||yiyj||2)klexp(||ylyk||2)tSNE:pij=exp(||xixj||2/(2σ2))klexp(||xlxk||2/(2σ2))qij=(1+||yiyj||2)1kl(1+||ylyk||2)1

We could see that the different is at

qij part, t-SNE using an different distribution to describe it.

The gradient descent derivation is also different due to different

qij
C=KL(P||Q)=ijipijlogpijqijsymmetric SNE:δCδyi=2j(pijqij)(yiyj)tSNE:δCδyi=4j(pijqij)(yiyj)(1+||yiyj||2)1

So, we only need to find out the specific code segment related to those part and modify it into symmetric SNE form.

Look at

C, we could observ that when the data is more closed in higher dimension, then the data would not be sparse, too.
But if datas are sparse in higher dimension, datas in lower dimension might be closed!
Shao-Wei Chiu

The first part is shown below(line 14 to line 15):

... # Compute pairwise affinities sum_Y = np.sum(np.square(Y), 1) num = -2. * np.dot(Y, Y.T) num = 1. / (1. + np.add(np.add(num, sum_Y).T, sum_Y)) num[range(n), range(n)] = 0. Q = num / np.sum(num) Q = np.maximum(Q, 1e-12) if sym_sne: sum_Y_ssne = np.sum(np.square(Y_ssne), 1) num_ssne = -2. * np.dot(Y_ssne, Y_ssne.T) # different between t-sne region num_ssne = np.add(np.add(num_ssne, sum_Y_ssne).T, sum_Y_ssne) num_ssne = np.exp(-1 * num_ssne) # different between t-sne region num_ssne[range(n), range(n)] = 0. Q_ssne = num_ssne / np.sum(num_ssne) Q_ssne = np.maximum(Q_ssne, 1e-12) ...

And the second part is line 9 to line 10:

... dY[i, :] = np.sum( np.tile(PQ[:, i] * num[:, i], (no_dims, 1)).T * (Y[i, :] - Y), 0) if sym_sne: # different between t-sne dY_ssne[i, :] = np.sum( np.tile(PQ_ssne[:, i], (no_dims, 1)).T * (Y_ssne[i, :] - Y_ssne), 0) ...

I modify it to perform t-SNE and symmetric SNE by calling tsen() method only, so I copy some parameter of t-SNE for running symmetric SNE at the same time.
And I stored the Y into a list at each ten steps for visualization.
I implement some useful method for visualization ( e.g., make_gif(...), show_similarity(...)), too.

def make_gif(record, labels, method, perplexity):
    camera = Camera(plt.figure())
    plt.title(method + ' with Perplexity=' + str(perplexity))
    for i in range(len(record)):
        img = plt.scatter(record[i][:, 0], record[i][:, 1], 20, labels)
        camera.snap()
    anim = camera.animate(interval=5, repeat_delay=20)
    anim.save(
        'output/' + method + '_' + str(perplexity) + '.gif', writer='pillow')
    plt.scatter(record[-1][:, 0], record[-1][:, 1], 20, labels)
    plt.savefig('output/' + method + '_' + str(perplexity) + '.png')


def show_similarity(S, labels, title, filename, perplexity):
    n = len(S)
    sort_idx = np.concatenate(
        [np.where(labels == l)[0] for l in np.unique(labels)])
    plt.figure(figsize=(10 * n, 7.5))
    S = [np.log(p[:, sort_idx][sort_idx, :]) for p in S]
    all_min = min([np.min(p) for p in S])
    all_max = max([np.max(p) for p in S])
    for i in range(n):
        plt.subplot(1, n, i + 1)
        plt.title(title[i])
        im = plt.imshow(S[i], cmap='gray', vmin=all_min, vmax=all_max)
        plt.colorbar(im)
    plt.savefig('output/' + filename + '_' + str(perplexity) + '.png')
def gradient_descend(iter, Y, iY, dY, gains, min_gain, momentum, eta):
    gains = (gains + 0.2) * ((dY > 0.) != (iY > 0.)) + (gains * 0.8) * (
        (dY > 0.) == (iY > 0.))
    gains[gains < min_gain] = min_gain
    iY = momentum * iY - eta * (gains * dY)
    Y = Y + iY
    Y = Y - np.tile(np.mean(Y, 0), (Y.shape[0], 1))

    return Y, iY, gains
def tsne(
        X=np.array([]), no_dims=2, initial_dims=50, perplexity=30.0,
        sym_sne=False):
    """
        Runs t-SNE on the dataset in the NxD array X to reduce its
        dimensionality to no_dims dimensions. The syntaxis of the function is
        `Y = tsne.tsne(X, no_dims, perplexity), where X is an NxD NumPy array.
    """

    # Check inputs
    if isinstance(no_dims, float):
        print("Error: array X should have type float.")
        return -1
    if round(no_dims) != no_dims:
        print("Error: number of dimensions should be an integer.")
        return -1

    # Initialize variables
    (n, d) = X.shape
    max_iter = 1000
    initial_momentum = 0.5
    final_momentum = 0.8
    eta = 500
    min_gain = 0.01

    record = []
    Y = np.random.randn(n, no_dims)
    dY = np.zeros((n, no_dims))
    iY = np.zeros((n, no_dims))
    gains = np.ones((n, no_dims))

    if sym_sne:
        record_ssne = []
        Y_ssne = Y.copy()
        dY_ssne = np.zeros((n, no_dims))
        iY_ssne = np.zeros((n, no_dims))
        gains_ssne = np.ones((n, no_dims))
        Q_ssne = np.zeros((n, n))

    # Compute P-values
    P = x2p(X, 1e-5, perplexity)
    P = P + np.transpose(P)
    P = P / np.sum(P)
    P = P * 4.  # early exaggeration
    P = np.maximum(P, 1e-12)

    # Run iterations
    for iter in range(max_iter):

        # Compute pairwise affinities
        sum_Y = np.sum(np.square(Y), 1)
        num = -2. * np.dot(Y, Y.T)
        num = 1. / (1. + np.add(np.add(num, sum_Y).T, sum_Y))
        num[range(n), range(n)] = 0.
        Q = num / np.sum(num)
        Q = np.maximum(Q, 1e-12)

        if sym_sne:
            sum_Y_ssne = np.sum(np.square(Y_ssne), 1)
            num_ssne = -2. * np.dot(Y_ssne, Y_ssne.T)
            # different between t-sne region
            num_ssne = np.add(np.add(num_ssne, sum_Y_ssne).T, sum_Y_ssne)
            num_ssne = np.exp(-1 * num_ssne)
            # different between t-sne region
            num_ssne[range(n), range(n)] = 0.
            Q_ssne = num_ssne / np.sum(num_ssne)
            Q_ssne = np.maximum(Q_ssne, 1e-12)

        # Compute gradient
        PQ = P - Q
        if sym_sne:
            PQ_ssne = P - Q_ssne
        for i in range(n):
            dY[i, :] = np.sum(
                np.tile(PQ[:, i] * num[:, i], (no_dims, 1)).T * (Y[i, :] - Y),
                0)
            if sym_sne:
                # different between t-sne
                dY_ssne[i, :] = np.sum(
                    np.tile(PQ_ssne[:, i],
                            (no_dims, 1)).T * (Y_ssne[i, :] - Y_ssne), 0)

        # Perform the update
        if iter < 20:
            momentum = initial_momentum
        else:
            momentum = final_momentum

        Y, iY, gains = gradient_descend(iter, Y, iY, dY, gains, min_gain,
                                        momentum, eta)
        if sym_sne:
            Y_ssne, iY_ssne, gains_ssne = gradient_descend(
                iter, Y_ssne, iY_ssne, dY_ssne, gains_ssne, min_gain, momentum,
                eta)

        # Compute current value of cost function
        if (iter + 1) % 10 == 0:
            C = np.sum(P * np.log(P / Q))
            record.append(Y)
            if sym_sne:
                C_ssne = np.sum(P * np.log(P / Q_ssne))
                record_ssne.append(Y_ssne)
                print("Iter %4d: tsne error = %f, sym-sne error = %f" %
                      (iter + 1, C, C_ssne))
            else:
                print("Iter %4d: tsne error = %f" % (iter + 1, C))

        # Stop lying about P-values
        if iter == 100:
            P = P / 4.

    # Return solution
    if not sym_sne:
        return record, P, Q
    else:
        return (record, record_ssne), P, (Q, Q_ssne)

Result and Discussion

  1. Visualize the embedding of both t-SNE and symmetric SNE

    • t-sne with preplexity=64
      t-sne with preplexity=64

    • sym-sne with preplexity=64
      sym-sne with preplexity=64

  2. Visualize the distribution of pairwise similarities in both high-dimensional space and low-dimensional space, based on both t-SNE and symmetric SNE
    ( P, Q with t-SNE, sym-SNE, respectively )

    We could see that t-SNE's is more light, this is because t-distribution is more widly, when x is lower ,its probability is lower.

  3. Play with different perplexity values

    We could see that, when preplexity become large, t-SNE is more similar to symmetric SNE, I think this is related to entropy.
    When preplexity is large, hints that the entropy is large, too. So the orignal distribution is more uniform.
    Otherwise, when preplexity is small, hints that the entropy is small, and the original distribution is more similar to delta distribution.

    And t-SNE is hard to tell the different in uniform distribution, because of the higer dimension data is more similar, and the result would be crowded.

    For the small preplexity case, I observ that the t-SNE seems to split some labels.
    I think this is because the higer dimension data become more dissimilar.