School: National Chiao Tung University
Name: Shao-Wei ( Willy ) Chiu
NCTU
, Machine Learning
, Embedding
, PCA
, LDA
, SNE
, t-SNE
, GitHub
Note: This report was made on
Hackmd.io
and restricted by the.gif
animation would not display. Please view it on https://hackmd.io/@swchiu/SJQehejA8, thanks.
When dataset's feature space grows into very high dimensions, there would be some useless feature. We prefer to extract the more important features for our dataset.
let
PCA.py
)PCA
is a method to find a project matrix
and
By Rayleigh Quotient
, the first k
large eigenvector of
Look at
Let
Then, solving
So, we could solve eigen-problem in
Then, we could project data such like
def __get_covariance(self, datas):
n = datas.shape[0]
if self.is_kernel:
...
else:
# This part use a little trick
S = datas - np.sum(datas, axis=0) / n
S = np.matmul(S, S.T) / n
return S
...
def run(self):
S = self.__get_covariance(self.datas)
e_vals, e_vecs = self.__get_sorted_eigen(S, self.k)
mean_datas = self.datas - np.sum(
self.datas, axis=0) / self.datas.shape[0]
if self.is_kernel:
...
else:
W = np.matmul(mean_datas.T, e_vecs)
W /= np.sqrt(e_vals)
pca_space = np.matmul(self.datas, W)
return pca_space, W
...
pca = PCA(train_faces, k=25)
train_space, W_train = pca.run()
test_space = np.matmul(test_faces, W_train)
...
PCA.py
)At the first, mapping datas into feature space.
and then try to calculate the covariance as PCA
do.
We want to solve
Assume
We could observ that the eigenvector is linear combinition of
( here
What is
def __get_covariance(self, datas):
n = datas.shape[0]
if self.is_kernel:
N1 = np.ones((n, n)) / n
S = (datas - np.matmul(N1, datas) - np.matmul(datas, N1) +
np.matmul(np.matmul(N1, datas), N1))
return S
else:
...
How to project data?
def run(self):
S = self.__get_covariance(self.datas)
e_vals, e_vecs = self.__get_sorted_eigen(S, self.k)
mean_datas = self.datas - np.sum(
self.datas, axis=0) / self.datas.shape[0]
if self.is_kernel:
W = e_vecs
W /= np.sqrt(e_vals)
N1 = np.ones(self.datas.shape) / self.datas.shape[0]
pca_space = np.matmul((self.datas - np.matmul(N1, self.datas)), W)
else:
...
...
K_train = kernel(train_faces, train_faces)
K_test_train = kernel(test_faces, train_faces)
kpca = PCA(K_train, k=25, is_kernel=True)
train_space, alpha = kpca.run()
How about projecting new data
K_train = kernel(train_faces, train_faces)
K_test_train = kernel(test_faces, train_faces)
kpca = PCA(K_train, k=25, is_kernel=True)
train_space, alpha = kpca.run()
NM1 = np.ones(K_test_train.shape) / K_train.shape[0]
### Project testing data at next line
test_space = np.matmul(K_test_train - np.matmul(NM1, K_train), alpha)
LDA.py
)Not like PCA
is projecting data into eigen-features space. LDA
projecting data for maximize the distance between different labels/clusters, and minimize the distance whthin labels/clusters.
solving project matrix
But
In our work,
class LDA():
def __init__(self, datas, labels, k, is_kernel=False):
self.datas = datas
self.n = datas.shape[0]
self.labels = labels
self.k = k
self.is_kernel = is_kernel
def __count_labels(self):
C = np.zeros((self.n, len(np.unique(self.labels))))
for idx, j in enumerate(np.unique(self.labels)):
C[self.labels == j, idx] = 1
return C
def __get_Sb_Sw(self, C):
Mj = np.matmul(self.datas.T, C) / np.sum(C, axis=0)
M = np.sum(self.datas.T, axis=1) / self.datas.shape[0]
B = Mj - M[:, None]
Sb = np.matmul(B * np.sum(C, axis=0), B.T)
W = self.datas.T - np.matmul(Mj, C.T)
Sw = np.zeros(Sb.shape)
for group in np.unique(self.labels):
w = W[:, self.labels == group]
Sw += (np.matmul(w, w.T) / w.shape[1])
return Sb, Sw
def __get_sorted_eigen(self, A, k):
eigenvalues, eigenvectors = np.linalg.eigh(A)
sorted_idx = np.flip(np.argsort(eigenvalues))
sorted_eigenvalues = []
sorted_eigenvectors = []
for i in range(k):
vector = eigenvectors[:, sorted_idx[i]]
sorted_eigenvectors.append(vector[:, None])
sorted_eigenvalues.append(eigenvalues[sorted_idx[i]])
sorted_eigenvalues = np.array(sorted_eigenvalues)
sorted_eigenvectors = np.concatenate(sorted_eigenvectors, axis=1)
return sorted_eigenvalues, sorted_eigenvectors
def run(self):
C = self.__count_labels()
Sb, Sw = self.__get_Sb_Sw(C)
#try:
# Sw_inv = np.linalg.inv(Sw)
#except:
# Sw_inv = np.linalg.pinv(Sw)
Sw_inv = np.linalg.pinv(Sw)
obj_matrix = np.matmul(Sw_inv, Sb)
value, vector = self.__get_sorted_eigen(obj_matrix, self.k)
return np.matmul(self.datas, vector), vector
LDA.py
)Referrence from wikipedia.
But I could not map testing data to the training's space correctly until the deadline coming, so my accurancy of face-reconition is low…
Decide the data belone to the cluster which has the largest numbers datas in its k-nearest neighbors.
import numpy as np
class KNN():
def __init__(self, train_datas, test_datas, labels, k=5):
self.train_datas = train_datas
self.test_datas = test_datas
self.labels = labels
self.k = k
def __eucidiance(self, U, V):
return np.matmul(U**2, np.ones(
(U.shape[1], V.shape[0]))) - 2 * np.matmul(U, V.T) + np.matmul(
np.ones((U.shape[0], V.shape[1])), (V.T)**2)
def run(self):
dist = self.__eucidiance(self.test_datas, self.train_datas)
closet = np.argsort(dist, axis=1)
y = []
for i in range(self.test_datas.shape[0]):
for j in range(self.k):
y.append(self.labels[closet[i][j]])
y = np.array(y)
y = y.reshape(self.test_datas.shape[0], self.k)
count = []
for i in range(len(self.labels)):
count.append(np.count_nonzero(y == self.labels[i], axis=1))
count = np.vstack(count)
y = np.argmax(count, axis=0)
predict = [None] * self.test_datas.shape[0]
for i in range(self.test_datas.shape[0]):
predict[i] = self.labels[y[i]]
return predict
...
knn = KNN(train_space, test_space, train_labels, k=5)
predict = knn.run()
util.py
)I use three different kernel, linear, polynomial and rbf.
def euclidean(u, v):
return np.matmul(u**2, np.ones(
(u.shape[1], v.shape[0]))) - 2 * np.matmul(u, v.T) + np.matmul(
np.ones((u.shape[0], v.shape[1])), (v.T)**2)
def rbf(u, v, g=10**-11):
return np.exp(-1 * g * euclidean(u, v))
def linear(u, v):
return np.matmul(u, v.T)
def polynomial(u, v, g=0.7, coef0=10, d=5):
return ((g * np.matmul(u, v.T)) + coef0)**d
util.py
)read_face
method read the dataset and resize the image into 100*100
.
show_face
method combine each face matrix in column and the row for visualize.
def read_faces(dir_path):
faces = []
labels = []
for filename in os.listdir(dir_path):
with PIL.Image.open(dir_path + filename) as im:
im = im.resize((100, 100), PIL.Image.BILINEAR)
faces.append([np.array(im).reshape(1, -1)])
labels.append(filename.split('.', 1)[0])
faces = np.concatenate(faces, axis=0)
faces = faces.reshape(faces.shape[0], faces.shape[2])
faces = faces.astype('int64')
return faces, labels
def show_faces(faces, filename=None, col=5):
#f = faces.reshape(-1,231,195)
f = faces.reshape(-1, 100, 100)
n = f.shape[0]
all_faces = []
for i in range(int(n / col)):
all_faces.append([np.concatenate(f[col * i:col * (i + 1)], axis=1)])
all_faces = np.concatenate(all_faces[:], axis=1)
all_faces = all_faces.reshape(all_faces.shape[1], all_faces.shape[2])
plt.figure(figsize=(1.5 * col, 1.5 * n / col))
plt.title(filename)
plt.imshow(all_faces, cmap='gray')
plt.subplots_adjust(left=0.05, right=0.95, top=0.85, bottom=0.15)
if (filename):
plt.savefig('./output/' + filename)
else:
plt.show()
The first 25 eigenfaces and fisherfaces
Reconstruct random 10 faces
and 4. Doing face recognition via PCA/KPCA and LDA/KLDA
(base) swchiu@gpuserv1:~/Embedding$ python3 main.py
PCA
Face-reconition accuracy: 27/30 = 90.00%
Kernel PCA
(rbf) Face-reconition accuracy: 27/30 = 90.00%
(polynomial) Face-reconition accuracy: 25/30 = 83.33%
(linear) Face-reconition accuracy: 27/30 = 90.00%
LDA
Face-reconition accuracy: 28/30 = 93.33%
Kernel LDA
(rbf) Face-reconition accuracy: 24/30 = 80.00%
(polynomial) Face-reconition accuracy: 18/30 = 60.00%
(linear) Face-reconition accuracy: 18/30 = 60.00%
LDA
is better than PCA
, I think the reason is that LDA
could split the data more widly in its concept, but PCA
only to mapping data for preserve the maximum variance.
In this case, we use KNN
to classify data, which is based on euclidiean distance, and LDA
could split widly in euclidiean distance.
RBF
and Linear
looks similar in KPCA
, and RBF
performs best in KLDA
.
Polynomial
performs worst in both KPCA
and KLDA
.
But the results of KLDA
were not in my expectation. I think the problem is that I use the wrong mapping way to testing data. So I tried to derive the formula on my own, but I could not make a sence until the deadline coming.
Referrence to the https://lvdmaaten.github.io/tsne/code/tsne_python.zip and modify from it.
t-SNE
using a different distribution on the reducted data ( i.e, student t-distribution
, which is a more widly distribution than Gaussion
).
Because t-SNE
use a more widly distribution, the crowded problem would be sloved.
We could see that when the data is more closed, the distance in
t-SNE
is more far.
The crowded problem which symmetric SNE
would face, I will show at the result part below.
SNE
's concept is that it want to preserve the probabilitic distribution after the reduction.
Look at the definition of probability of symmetreic SNE
and t-SNE
,
We could see that the different is at t-SNE
using an different distribution to describe it.
The gradient descent derivation is also different due to different
So, we only need to find out the specific code segment related to those part and modify it into symmetric SNE
form.
Look at
, we could observ that when the data is more closed in higher dimension, then the data would not be sparse, too.
But if datas are sparse in higher dimension, datas in lower dimension might be closed!
Shao-Wei Chiu
The first part is shown below(line 14 to line 15
):
...
# Compute pairwise affinities
sum_Y = np.sum(np.square(Y), 1)
num = -2. * np.dot(Y, Y.T)
num = 1. / (1. + np.add(np.add(num, sum_Y).T, sum_Y))
num[range(n), range(n)] = 0.
Q = num / np.sum(num)
Q = np.maximum(Q, 1e-12)
if sym_sne:
sum_Y_ssne = np.sum(np.square(Y_ssne), 1)
num_ssne = -2. * np.dot(Y_ssne, Y_ssne.T)
# different between t-sne region
num_ssne = np.add(np.add(num_ssne, sum_Y_ssne).T, sum_Y_ssne)
num_ssne = np.exp(-1 * num_ssne)
# different between t-sne region
num_ssne[range(n), range(n)] = 0.
Q_ssne = num_ssne / np.sum(num_ssne)
Q_ssne = np.maximum(Q_ssne, 1e-12)
...
And the second part is line 9 to line 10
:
...
dY[i, :] = np.sum(
np.tile(PQ[:, i] * num[:, i], (no_dims, 1)).T * (Y[i, :] - Y),
0)
if sym_sne:
# different between t-sne
dY_ssne[i, :] = np.sum(
np.tile(PQ_ssne[:, i],
(no_dims, 1)).T * (Y_ssne[i, :] - Y_ssne), 0)
...
I modify it to perform t-SNE
and symmetric SNE
by calling tsen()
method only, so I copy some parameter of t-SNE
for running symmetric SNE
at the same time.
And I stored the Y
into a list at each ten steps for visualization.
I implement some useful method for visualization ( e.g., make_gif(...)
, show_similarity(...)
), too.
def make_gif(record, labels, method, perplexity):
camera = Camera(plt.figure())
plt.title(method + ' with Perplexity=' + str(perplexity))
for i in range(len(record)):
img = plt.scatter(record[i][:, 0], record[i][:, 1], 20, labels)
camera.snap()
anim = camera.animate(interval=5, repeat_delay=20)
anim.save(
'output/' + method + '_' + str(perplexity) + '.gif', writer='pillow')
plt.scatter(record[-1][:, 0], record[-1][:, 1], 20, labels)
plt.savefig('output/' + method + '_' + str(perplexity) + '.png')
def show_similarity(S, labels, title, filename, perplexity):
n = len(S)
sort_idx = np.concatenate(
[np.where(labels == l)[0] for l in np.unique(labels)])
plt.figure(figsize=(10 * n, 7.5))
S = [np.log(p[:, sort_idx][sort_idx, :]) for p in S]
all_min = min([np.min(p) for p in S])
all_max = max([np.max(p) for p in S])
for i in range(n):
plt.subplot(1, n, i + 1)
plt.title(title[i])
im = plt.imshow(S[i], cmap='gray', vmin=all_min, vmax=all_max)
plt.colorbar(im)
plt.savefig('output/' + filename + '_' + str(perplexity) + '.png')
def gradient_descend(iter, Y, iY, dY, gains, min_gain, momentum, eta):
gains = (gains + 0.2) * ((dY > 0.) != (iY > 0.)) + (gains * 0.8) * (
(dY > 0.) == (iY > 0.))
gains[gains < min_gain] = min_gain
iY = momentum * iY - eta * (gains * dY)
Y = Y + iY
Y = Y - np.tile(np.mean(Y, 0), (Y.shape[0], 1))
return Y, iY, gains
def tsne(
X=np.array([]), no_dims=2, initial_dims=50, perplexity=30.0,
sym_sne=False):
"""
Runs t-SNE on the dataset in the NxD array X to reduce its
dimensionality to no_dims dimensions. The syntaxis of the function is
`Y = tsne.tsne(X, no_dims, perplexity), where X is an NxD NumPy array.
"""
# Check inputs
if isinstance(no_dims, float):
print("Error: array X should have type float.")
return -1
if round(no_dims) != no_dims:
print("Error: number of dimensions should be an integer.")
return -1
# Initialize variables
(n, d) = X.shape
max_iter = 1000
initial_momentum = 0.5
final_momentum = 0.8
eta = 500
min_gain = 0.01
record = []
Y = np.random.randn(n, no_dims)
dY = np.zeros((n, no_dims))
iY = np.zeros((n, no_dims))
gains = np.ones((n, no_dims))
if sym_sne:
record_ssne = []
Y_ssne = Y.copy()
dY_ssne = np.zeros((n, no_dims))
iY_ssne = np.zeros((n, no_dims))
gains_ssne = np.ones((n, no_dims))
Q_ssne = np.zeros((n, n))
# Compute P-values
P = x2p(X, 1e-5, perplexity)
P = P + np.transpose(P)
P = P / np.sum(P)
P = P * 4. # early exaggeration
P = np.maximum(P, 1e-12)
# Run iterations
for iter in range(max_iter):
# Compute pairwise affinities
sum_Y = np.sum(np.square(Y), 1)
num = -2. * np.dot(Y, Y.T)
num = 1. / (1. + np.add(np.add(num, sum_Y).T, sum_Y))
num[range(n), range(n)] = 0.
Q = num / np.sum(num)
Q = np.maximum(Q, 1e-12)
if sym_sne:
sum_Y_ssne = np.sum(np.square(Y_ssne), 1)
num_ssne = -2. * np.dot(Y_ssne, Y_ssne.T)
# different between t-sne region
num_ssne = np.add(np.add(num_ssne, sum_Y_ssne).T, sum_Y_ssne)
num_ssne = np.exp(-1 * num_ssne)
# different between t-sne region
num_ssne[range(n), range(n)] = 0.
Q_ssne = num_ssne / np.sum(num_ssne)
Q_ssne = np.maximum(Q_ssne, 1e-12)
# Compute gradient
PQ = P - Q
if sym_sne:
PQ_ssne = P - Q_ssne
for i in range(n):
dY[i, :] = np.sum(
np.tile(PQ[:, i] * num[:, i], (no_dims, 1)).T * (Y[i, :] - Y),
0)
if sym_sne:
# different between t-sne
dY_ssne[i, :] = np.sum(
np.tile(PQ_ssne[:, i],
(no_dims, 1)).T * (Y_ssne[i, :] - Y_ssne), 0)
# Perform the update
if iter < 20:
momentum = initial_momentum
else:
momentum = final_momentum
Y, iY, gains = gradient_descend(iter, Y, iY, dY, gains, min_gain,
momentum, eta)
if sym_sne:
Y_ssne, iY_ssne, gains_ssne = gradient_descend(
iter, Y_ssne, iY_ssne, dY_ssne, gains_ssne, min_gain, momentum,
eta)
# Compute current value of cost function
if (iter + 1) % 10 == 0:
C = np.sum(P * np.log(P / Q))
record.append(Y)
if sym_sne:
C_ssne = np.sum(P * np.log(P / Q_ssne))
record_ssne.append(Y_ssne)
print("Iter %4d: tsne error = %f, sym-sne error = %f" %
(iter + 1, C, C_ssne))
else:
print("Iter %4d: tsne error = %f" % (iter + 1, C))
# Stop lying about P-values
if iter == 100:
P = P / 4.
# Return solution
if not sym_sne:
return record, P, Q
else:
return (record, record_ssne), P, (Q, Q_ssne)
Visualize the embedding of both t-SNE
and symmetric SNE
t-sne with preplexity=64
sym-sne with preplexity=64
Visualize the distribution of pairwise similarities in both high-dimensional space and low-dimensional space, based on both t-SNE
and symmetric SNE
( P, Q with t-SNE
, sym-SNE
, respectively )
We could see that
t-SNE
's is more light, this is because t-distribution is more widly, when x is lower ,its probability is lower.
Play with different perplexity values
We could see that, when preplexity
become large, t-SNE
is more similar to symmetric SNE
, I think this is related to entropy.
When preplexity
is large, hints that the entropy is large, too. So the orignal distribution is more uniform.
Otherwise, when preplexity
is small, hints that the entropy is small, and the original distribution is more similar to delta distribution.
And t-SNE
is hard to tell the different in uniform distribution, because of the higer dimension data is more similar, and the result would be crowded.
For the small preplexity
case, I observ that the t-SNE
seems to split some labels.
I think this is because the higer dimension data become more dissimilar.