# 2020 Machine Learning Homework 7
School: National Chiao Tung University
Name: Shao-Wei ( Willy ) Chiu
###### tags: `NCTU`, `Machine Learning`, `Embedding`, `PCA`, `LDA`, `SNE`, `t-SNE`, `GitHub`
> Note: This report was made on `Hackmd.io` and restricted by the `.pdf` format, the `.gif` animation would not display. Please view it on [https://hackmd.io/@swchiu/SJQehejA8](https://hackmd.io/@swchiu/SJQehejA8), thanks.
## PCA / LDA
When dataset's feature space grows into very high dimensions, there would be some useless feature. We prefer to extract the more important features for our dataset.
let
$$
x_i = \begin{bmatrix}x_{i1} & x_{i2} & ... & x_{id} \end{bmatrix} \\
X_{n\times d} = \begin{bmatrix}x_1 \\x_2 \\... \\x_n \end{bmatrix}
$$
### PCA (`PCA.py`)
`PCA` is a method to find a project matrix $W_{d\times k}$, which could project data $X_{n\times d}$ into $Z_{n\times k}$, i.e.,
$$
Z_{n\times k} = X_{n\times d}W_{d\times k}
$$
and $W_{d\times k}$ let $Z_{n\times k}$ has maximal variance, which hints that $Z_{n\times k}$ also has minimal square error between $X_{n\times d}$
$$
\begin{aligned}
\underset{W}{argmax}(\sigma^2(Z)) &= \frac{1}{n}\sum(z_i-\overline Z)^T(z_i-\overline Z) \\
&= \frac{1}{n}\sum(x_iw_i-\overline x_iw_i)^T(x_iw_i-\overline x_iw_i) \\
&= \frac{1}{n}\sum((x_i-\overline x_i)w_i)^T((x_i-\overline x_i)w_i) \\
&= W^T\frac{1}{n}\sum((x_i-\overline x_i)^T(x_i-\overline x_i))W \\
&= W^TSW \ (\ diagonal\ value\ )
\end{aligned}
$$
By `Rayleigh Quotient`, the first `k` large eigenvector of $S$ is our $W$, ( i.e., solving $Sv = \lambda v$ )
Look at $S_{d\times d}$ matrix, it is time consuming in solving its eigen-problem when $d$ is very large. We could do a little trick on it.
Let
$$
\begin{aligned}
\phi(X) &= X-\overline X \\
C_{n\times n} &= \frac{1}{n}(X-\overline X)(X-\overline X)^T = \frac{1}{n}\phi(X)\phi(X)^T\\
S_{d\times d} &= \frac{1}{n}(X-\overline X)^T(X-\overline X) = \frac{1}{n}\phi(X)^T\phi(X)
\end{aligned}
$$
Then, solving
$$
\begin{aligned}
Cv &= \lambda v \\
\phi(X)^TCv &= \lambda \phi(X)^Tv \\
\phi(X)^T\frac{1}{n}\phi(X)\phi(X)^Tv &= \lambda \phi(X)^Tv \\
(\frac{1}{n}\phi(X)^T\phi(X))\phi(X)^Tv &= \lambda \phi(X)^Tv \\
S(\phi(X)^Tv) &= \lambda (\phi(X)^Tv) \\
SW &= \lambda W
\end{aligned}
$$
So, we could solve eigen-problem in $n\times n$ dimension to get $W = \phi(X)^TV = (X-\overline X)^TV$.
Then, we could project data such like $XW$ and $X_{test}W$.
```python
def __get_covariance(self, datas):
n = datas.shape[0]
if self.is_kernel:
...
else:
# This part use a little trick
S = datas - np.sum(datas, axis=0) / n
S = np.matmul(S, S.T) / n
return S
...
def run(self):
S = self.__get_covariance(self.datas)
e_vals, e_vecs = self.__get_sorted_eigen(S, self.k)
mean_datas = self.datas - np.sum(
self.datas, axis=0) / self.datas.shape[0]
if self.is_kernel:
...
else:
W = np.matmul(mean_datas.T, e_vecs)
W /= np.sqrt(e_vals)
pca_space = np.matmul(self.datas, W)
return pca_space, W
```
```python
...
pca = PCA(train_faces, k=25)
train_space, W_train = pca.run()
test_space = np.matmul(test_faces, W_train)
...
```
### kernel PCA (`PCA.py`)
At the first, mapping datas into feature space.
$$
X_{n\times d} \rightarrow \phi(X)_{n\times m}
$$
and then try to calculate the covariance as `PCA` do.
$$
\begin{aligned}
cov(\phi(X)) &= \frac{1}{n}(\phi(X)-\frac{1}{n}\sum\phi(X))^T(\phi(X)-\frac{1}{n}\sum\phi(X)), (\ let\ \Psi(X) = \phi(X)-\frac{1}{n}\sum\phi(X))\\
&= \frac{1}{n}\Psi(X)^T\Psi(X) \\
&= S
\end{aligned}
$$
We want to solve $SW=\lambda W$, but $\Psi(X)$ is not explict.
==Assume $X$ is $d\times n$==,
$$
\begin{aligned}
Sw &= \lambda w \\
\frac{1}{n}XX^Tw &= \lambda w \\
\frac{1}{n\lambda} \sum x_i(x_i^Tw) &= w \\
\sum[\frac{1}{n\lambda}(x_i^Tw)]x_i &= w \\
\sum \alpha_i x_i&= w \\
XA &= w
\end{aligned}
$$
We could observ that the eigenvector is linear combinition of $X$, so we could rewrite the eigen-problem as following:
==( here $X$ is $n\times d$ )==
$$
\begin{aligned}
\frac{1}{n}\Psi(X)^T\Psi(X)w &= \lambda w \\
\frac{1}{n}\Psi(X)^T\Psi(X)\Psi(X)^TA &= \lambda \Psi(X)^TA \\
\frac{1}{n}\Psi(X)^T\Psi(X)\Psi(X)^TA &= \lambda \Psi(X)^TA \\
\frac{1}{n}\Psi(X)\Psi(X)^T\Psi(X)\Psi(X)^TA &= \lambda \Psi(X)\Psi(X)^TA \\
\frac{1}{n}K^cK^cA=\lambda K^cA \\
\frac{1}{n}K^cA=\lambda A
\end{aligned}
$$
What is $K^c$ ?
$$
\begin{aligned}
let\ K &= \phi(X)\phi(X)^T \\
K^c &= \Psi(X)\Psi(X)^T \\
&= (\phi(X)-\frac{1}{n}\sum\phi(X))(\phi(X)-\frac{1}{n}\sum\phi(X))^T \\
&= \phi(X)\phi(X)^T-(\frac{1}{n}\sum\phi(X))\phi(X)^T-\phi(X)(\frac{1}{n}\sum\phi(X)^T)+\frac{1}{n^2}\sum\phi(X)\phi(X)^T \\
&= K-K1_n-1_nK+1_nK1_n,\ (\ 1_n\ is\ a\ n\times n\ matrix\ whose\ elements\ are\ all\ \frac{1}{n}\ )
\end{aligned}
$$
```python
def __get_covariance(self, datas):
n = datas.shape[0]
if self.is_kernel:
N1 = np.ones((n, n)) / n
S = (datas - np.matmul(N1, datas) - np.matmul(datas, N1) +
np.matmul(np.matmul(N1, datas), N1))
return S
else:
...
```
How to project data?
$$
\begin{aligned}
\phi(X)W &= \phi(X)\Psi(X)^TA \\
&= \phi(X)(\phi(X)-\frac{1}{n}\sum\phi(X))^TA \\
&= (\phi(X)\phi(X)^T-\phi(X)(\frac{1}{n}\sum\phi(X)^T))A \\
&= (K-1_nK)A
\end{aligned}
$$
```python
def run(self):
S = self.__get_covariance(self.datas)
e_vals, e_vecs = self.__get_sorted_eigen(S, self.k)
mean_datas = self.datas - np.sum(
self.datas, axis=0) / self.datas.shape[0]
if self.is_kernel:
W = e_vecs
W /= np.sqrt(e_vals)
N1 = np.ones(self.datas.shape) / self.datas.shape[0]
pca_space = np.matmul((self.datas - np.matmul(N1, self.datas)), W)
else:
...
...
K_train = kernel(train_faces, train_faces)
K_test_train = kernel(test_faces, train_faces)
kpca = PCA(K_train, k=25, is_kernel=True)
train_space, alpha = kpca.run()
```
How about projecting new data $X_{test}$ ?
$$
\begin{aligned}
\phi(X_{test})W &= \phi(X_{test})\Psi(X)^TA \\
&= \phi(X_{test})(\phi(X)-\frac{1}{n}\sum\phi(X))^TA \\
&= (\phi(X_{test})\phi(X)^T-\phi(X_{test})(\frac{1}{n}\sum\phi(X)^T))A \\
&= (K_{test\_train}-1_{n:m\times n}K)A
\end{aligned}
$$
``` python
K_train = kernel(train_faces, train_faces)
K_test_train = kernel(test_faces, train_faces)
kpca = PCA(K_train, k=25, is_kernel=True)
train_space, alpha = kpca.run()
NM1 = np.ones(K_test_train.shape) / K_train.shape[0]
### Project testing data at next line
test_space = np.matmul(K_test_train - np.matmul(NM1, K_train), alpha)
```
### LDA (`LDA.py`)
Not like `PCA` is projecting data into eigen-features space. `LDA` projecting data for maximize the distance between different labels/clusters, and minimize the distance whthin labels/clusters.
solving project matrix $W$ as eigen-problem below:
$$
S_W^{-1}S_Bw=\lambda w, \ where\\
S_W = \sum_{j=1}^k S_j,\ S_j = \sum (x_i-m_j)(x_i-m_j)^T,\ m_j = \frac{1}{n_j}\sum_{i \in C_j}x_i \\
S_B = \sum_{j=1}^k S_{B_{j}} = \sum_{j=1}^k n_j(m_j-m)(m_j-m)^T
$$
But $S_W$ might singular when $n < d$, So We could find its pseudo inverse, instead.
In our work, $n$ is actually smaller than $d$, so I find pseudo inverse directly.
```python
class LDA():
def __init__(self, datas, labels, k, is_kernel=False):
self.datas = datas
self.n = datas.shape[0]
self.labels = labels
self.k = k
self.is_kernel = is_kernel
def __count_labels(self):
C = np.zeros((self.n, len(np.unique(self.labels))))
for idx, j in enumerate(np.unique(self.labels)):
C[self.labels == j, idx] = 1
return C
def __get_Sb_Sw(self, C):
Mj = np.matmul(self.datas.T, C) / np.sum(C, axis=0)
M = np.sum(self.datas.T, axis=1) / self.datas.shape[0]
B = Mj - M[:, None]
Sb = np.matmul(B * np.sum(C, axis=0), B.T)
W = self.datas.T - np.matmul(Mj, C.T)
Sw = np.zeros(Sb.shape)
for group in np.unique(self.labels):
w = W[:, self.labels == group]
Sw += (np.matmul(w, w.T) / w.shape[1])
return Sb, Sw
def __get_sorted_eigen(self, A, k):
eigenvalues, eigenvectors = np.linalg.eigh(A)
sorted_idx = np.flip(np.argsort(eigenvalues))
sorted_eigenvalues = []
sorted_eigenvectors = []
for i in range(k):
vector = eigenvectors[:, sorted_idx[i]]
sorted_eigenvectors.append(vector[:, None])
sorted_eigenvalues.append(eigenvalues[sorted_idx[i]])
sorted_eigenvalues = np.array(sorted_eigenvalues)
sorted_eigenvectors = np.concatenate(sorted_eigenvectors, axis=1)
return sorted_eigenvalues, sorted_eigenvectors
def run(self):
C = self.__count_labels()
Sb, Sw = self.__get_Sb_Sw(C)
#try:
# Sw_inv = np.linalg.inv(Sw)
#except:
# Sw_inv = np.linalg.pinv(Sw)
Sw_inv = np.linalg.pinv(Sw)
obj_matrix = np.matmul(Sw_inv, Sb)
value, vector = self.__get_sorted_eigen(obj_matrix, self.k)
return np.matmul(self.datas, vector), vector
```
### Kernel LDA (`LDA.py`)
Referrence from [wikipedia](https://en.wikipedia.org/wiki/Kernel_Fisher_discriminant_analysis).
> But I could not map testing data to the training's space correctly until the deadline coming, so my accurancy of face-reconition is low......
### K-Nearest Neighbor
Decide the data belone to the cluster which has the largest numbers datas in its k-nearest neighbors.
```python
import numpy as np
class KNN():
def __init__(self, train_datas, test_datas, labels, k=5):
self.train_datas = train_datas
self.test_datas = test_datas
self.labels = labels
self.k = k
def __eucidiance(self, U, V):
return np.matmul(U**2, np.ones(
(U.shape[1], V.shape[0]))) - 2 * np.matmul(U, V.T) + np.matmul(
np.ones((U.shape[0], V.shape[1])), (V.T)**2)
def run(self):
dist = self.__eucidiance(self.test_datas, self.train_datas)
closet = np.argsort(dist, axis=1)
y = []
for i in range(self.test_datas.shape[0]):
for j in range(self.k):
y.append(self.labels[closet[i][j]])
y = np.array(y)
y = y.reshape(self.test_datas.shape[0], self.k)
count = []
for i in range(len(self.labels)):
count.append(np.count_nonzero(y == self.labels[i], axis=1))
count = np.vstack(count)
y = np.argmax(count, axis=0)
predict = [None] * self.test_datas.shape[0]
for i in range(self.test_datas.shape[0]):
predict[i] = self.labels[y[i]]
return predict
...
knn = KNN(train_space, test_space, train_labels, k=5)
predict = knn.run()
```
### Kernel ( in `util.py` )
I use three different kernel, linear, polynomial and rbf.
```python
def euclidean(u, v):
return np.matmul(u**2, np.ones(
(u.shape[1], v.shape[0]))) - 2 * np.matmul(u, v.T) + np.matmul(
np.ones((u.shape[0], v.shape[1])), (v.T)**2)
def rbf(u, v, g=10**-11):
return np.exp(-1 * g * euclidean(u, v))
def linear(u, v):
return np.matmul(u, v.T)
def polynomial(u, v, g=0.7, coef0=10, d=5):
return ((g * np.matmul(u, v.T)) + coef0)**d
```
### Other Methods in My Implementation ( in `util.py` )
`read_face` method read the dataset and resize the image into `100*100`.
`show_face` method combine each face matrix in column and the row for visualize.
```python
def read_faces(dir_path):
faces = []
labels = []
for filename in os.listdir(dir_path):
with PIL.Image.open(dir_path + filename) as im:
im = im.resize((100, 100), PIL.Image.BILINEAR)
faces.append([np.array(im).reshape(1, -1)])
labels.append(filename.split('.', 1)[0])
faces = np.concatenate(faces, axis=0)
faces = faces.reshape(faces.shape[0], faces.shape[2])
faces = faces.astype('int64')
return faces, labels
def show_faces(faces, filename=None, col=5):
#f = faces.reshape(-1,231,195)
f = faces.reshape(-1, 100, 100)
n = f.shape[0]
all_faces = []
for i in range(int(n / col)):
all_faces.append([np.concatenate(f[col * i:col * (i + 1)], axis=1)])
all_faces = np.concatenate(all_faces[:], axis=1)
all_faces = all_faces.reshape(all_faces.shape[1], all_faces.shape[2])
plt.figure(figsize=(1.5 * col, 1.5 * n / col))
plt.title(filename)
plt.imshow(all_faces, cmap='gray')
plt.subplots_adjust(left=0.05, right=0.95, top=0.85, bottom=0.15)
if (filename):
plt.savefig('./output/' + filename)
else:
plt.show()
```
### Result and Discussion
1. The first 25 eigenfaces and fisherfaces


2. Reconstruct random 10 faces



3. and 4. Doing face recognition via PCA/KPCA and LDA/KLDA
``` terminal
(base) swchiu@gpuserv1:~/Embedding$ python3 main.py
PCA
Face-reconition accuracy: 27/30 = 90.00%
Kernel PCA
(rbf) Face-reconition accuracy: 27/30 = 90.00%
(polynomial) Face-reconition accuracy: 25/30 = 83.33%
(linear) Face-reconition accuracy: 27/30 = 90.00%
LDA
Face-reconition accuracy: 28/30 = 93.33%
Kernel LDA
(rbf) Face-reconition accuracy: 24/30 = 80.00%
(polynomial) Face-reconition accuracy: 18/30 = 60.00%
(linear) Face-reconition accuracy: 18/30 = 60.00%
```
`LDA` is better than `PCA`, I think the reason is that `LDA` could split the data more widly in its concept, but `PCA` only to mapping data for preserve the maximum variance.
In this case, we use `KNN` to classify data, which is based on euclidiean distance, and `LDA` could split widly in euclidiean distance.
`RBF` and `Linear` looks similar in `KPCA`, and `RBF` performs best in `KLDA`.
`Polynomial` performs worst in both `KPCA` and `KLDA`.
But the results of `KLDA` were not in my expectation. I think the problem is that I use the wrong mapping way to testing data. So I tried to derive the formula on my own, but I could not make a sence until the deadline coming.
## t-SNE / Symmetric SNE
Referrence to the [https://lvdmaaten.github.io/tsne/code/tsne_python.zip](https://lvdmaaten.github.io/tsne/code/tsne_python.zip) and modify from it.
### Modify t-SNE into symmetric SNE
`t-SNE` using a different distribution on the reducted data ( i.e, `student t-distribution`, which is a more widly distribution than `Gaussion` ).
Because `t-SNE` use a more widly distribution, the crowded problem would be sloved.

> We could see that when the data is more closed, the distance in `t-SNE` is more far.
The crowded problem which `symmetric SNE` would face, I will show at the result part below.
`SNE`'s concept is that it want to preserve the probabilitic distribution after the reduction.
Look at the definition of probability of `symmetreic SNE` and `t-SNE`,
$$
\begin{aligned}
symmetric\ SNE: \\
p_{ij} &= \frac{exp(-||x_i - x_j||^2/(2\sigma^2))}{\sum_{k \neq i} exp(-||x_l - x_k||^2/(2\sigma^2))} \\
q_{ij} &= \frac{exp(-||y_i - y_j||^2)}{\sum_{k \neq l} exp(-||y_l - y_k||^2)} \\
t-SNE: \\
p_{ij} &= \frac{exp(-||x_i - x_j||^2/(2\sigma^2))}{\sum_{k \neq l} exp(-||x_l - x_k||^2/(2\sigma^2))} \\
q_{ij} &= \frac{(1+||y_i - y_j||^2)^{-1}}{\sum_{k \neq l} (1+||y_l - y_k||^2)^{-1}}
\end{aligned}
$$
We could see that the different is at $q_{ij}$ part, `t-SNE` using an different distribution to describe it.
The gradient descent derivation is also different due to different $q_{ij}$
$$
\begin{aligned}
C &= KL(P||Q)=\sum_{i}\sum_{j \neq i}p_{ij}\log\frac{p_{ij}}{q_{ij}}\\
symmetric\ SNE: \\ \\
\frac{\delta C}{\delta y_i} &= 2\sum_{j}(p_{ij}-q_{ij})(y_i-y_j) \\ \\
t-SNE: \\ \\
\frac{\delta C}{\delta y_i} &= 4\sum_{j}(p_{ij}-q_{ij})(y_i-y_j)(1+||y_i - y_j||^2)^{-1}
\end{aligned}
$$
So, we only need to find out the specific code segment related to those part and modify it into `symmetric SNE` form.
> Look at $C$, we could observ that when the data is more closed in higher dimension, then the data would not be sparse, too.
> But if datas are sparse in higher dimension, datas in lower dimension might be closed!
> [name=Shao-Wei Chiu]
The first part is shown below(`line 14 to line 15`):
```python=1
...
# Compute pairwise affinities
sum_Y = np.sum(np.square(Y), 1)
num = -2. * np.dot(Y, Y.T)
num = 1. / (1. + np.add(np.add(num, sum_Y).T, sum_Y))
num[range(n), range(n)] = 0.
Q = num / np.sum(num)
Q = np.maximum(Q, 1e-12)
if sym_sne:
sum_Y_ssne = np.sum(np.square(Y_ssne), 1)
num_ssne = -2. * np.dot(Y_ssne, Y_ssne.T)
# different between t-sne region
num_ssne = np.add(np.add(num_ssne, sum_Y_ssne).T, sum_Y_ssne)
num_ssne = np.exp(-1 * num_ssne)
# different between t-sne region
num_ssne[range(n), range(n)] = 0.
Q_ssne = num_ssne / np.sum(num_ssne)
Q_ssne = np.maximum(Q_ssne, 1e-12)
...
```
And the second part is `line 9 to line 10`:
```python=
...
dY[i, :] = np.sum(
np.tile(PQ[:, i] * num[:, i], (no_dims, 1)).T * (Y[i, :] - Y),
0)
if sym_sne:
# different between t-sne
dY_ssne[i, :] = np.sum(
np.tile(PQ_ssne[:, i],
(no_dims, 1)).T * (Y_ssne[i, :] - Y_ssne), 0)
...
```
I modify it to perform `t-SNE` and `symmetric SNE` by calling `tsen()` method only, so I copy some parameter of `t-SNE` for running `symmetric SNE` at the same time.
And I stored the `Y` into a list at each ten steps for visualization.
I implement some useful method for visualization ( e.g., `make_gif(...)`, `show_similarity(...)`), too.
```python
def make_gif(record, labels, method, perplexity):
camera = Camera(plt.figure())
plt.title(method + ' with Perplexity=' + str(perplexity))
for i in range(len(record)):
img = plt.scatter(record[i][:, 0], record[i][:, 1], 20, labels)
camera.snap()
anim = camera.animate(interval=5, repeat_delay=20)
anim.save(
'output/' + method + '_' + str(perplexity) + '.gif', writer='pillow')
plt.scatter(record[-1][:, 0], record[-1][:, 1], 20, labels)
plt.savefig('output/' + method + '_' + str(perplexity) + '.png')
def show_similarity(S, labels, title, filename, perplexity):
n = len(S)
sort_idx = np.concatenate(
[np.where(labels == l)[0] for l in np.unique(labels)])
plt.figure(figsize=(10 * n, 7.5))
S = [np.log(p[:, sort_idx][sort_idx, :]) for p in S]
all_min = min([np.min(p) for p in S])
all_max = max([np.max(p) for p in S])
for i in range(n):
plt.subplot(1, n, i + 1)
plt.title(title[i])
im = plt.imshow(S[i], cmap='gray', vmin=all_min, vmax=all_max)
plt.colorbar(im)
plt.savefig('output/' + filename + '_' + str(perplexity) + '.png')
```
```python
def gradient_descend(iter, Y, iY, dY, gains, min_gain, momentum, eta):
gains = (gains + 0.2) * ((dY > 0.) != (iY > 0.)) + (gains * 0.8) * (
(dY > 0.) == (iY > 0.))
gains[gains < min_gain] = min_gain
iY = momentum * iY - eta * (gains * dY)
Y = Y + iY
Y = Y - np.tile(np.mean(Y, 0), (Y.shape[0], 1))
return Y, iY, gains
def tsne(
X=np.array([]), no_dims=2, initial_dims=50, perplexity=30.0,
sym_sne=False):
"""
Runs t-SNE on the dataset in the NxD array X to reduce its
dimensionality to no_dims dimensions. The syntaxis of the function is
`Y = tsne.tsne(X, no_dims, perplexity), where X is an NxD NumPy array.
"""
# Check inputs
if isinstance(no_dims, float):
print("Error: array X should have type float.")
return -1
if round(no_dims) != no_dims:
print("Error: number of dimensions should be an integer.")
return -1
# Initialize variables
(n, d) = X.shape
max_iter = 1000
initial_momentum = 0.5
final_momentum = 0.8
eta = 500
min_gain = 0.01
record = []
Y = np.random.randn(n, no_dims)
dY = np.zeros((n, no_dims))
iY = np.zeros((n, no_dims))
gains = np.ones((n, no_dims))
if sym_sne:
record_ssne = []
Y_ssne = Y.copy()
dY_ssne = np.zeros((n, no_dims))
iY_ssne = np.zeros((n, no_dims))
gains_ssne = np.ones((n, no_dims))
Q_ssne = np.zeros((n, n))
# Compute P-values
P = x2p(X, 1e-5, perplexity)
P = P + np.transpose(P)
P = P / np.sum(P)
P = P * 4. # early exaggeration
P = np.maximum(P, 1e-12)
# Run iterations
for iter in range(max_iter):
# Compute pairwise affinities
sum_Y = np.sum(np.square(Y), 1)
num = -2. * np.dot(Y, Y.T)
num = 1. / (1. + np.add(np.add(num, sum_Y).T, sum_Y))
num[range(n), range(n)] = 0.
Q = num / np.sum(num)
Q = np.maximum(Q, 1e-12)
if sym_sne:
sum_Y_ssne = np.sum(np.square(Y_ssne), 1)
num_ssne = -2. * np.dot(Y_ssne, Y_ssne.T)
# different between t-sne region
num_ssne = np.add(np.add(num_ssne, sum_Y_ssne).T, sum_Y_ssne)
num_ssne = np.exp(-1 * num_ssne)
# different between t-sne region
num_ssne[range(n), range(n)] = 0.
Q_ssne = num_ssne / np.sum(num_ssne)
Q_ssne = np.maximum(Q_ssne, 1e-12)
# Compute gradient
PQ = P - Q
if sym_sne:
PQ_ssne = P - Q_ssne
for i in range(n):
dY[i, :] = np.sum(
np.tile(PQ[:, i] * num[:, i], (no_dims, 1)).T * (Y[i, :] - Y),
0)
if sym_sne:
# different between t-sne
dY_ssne[i, :] = np.sum(
np.tile(PQ_ssne[:, i],
(no_dims, 1)).T * (Y_ssne[i, :] - Y_ssne), 0)
# Perform the update
if iter < 20:
momentum = initial_momentum
else:
momentum = final_momentum
Y, iY, gains = gradient_descend(iter, Y, iY, dY, gains, min_gain,
momentum, eta)
if sym_sne:
Y_ssne, iY_ssne, gains_ssne = gradient_descend(
iter, Y_ssne, iY_ssne, dY_ssne, gains_ssne, min_gain, momentum,
eta)
# Compute current value of cost function
if (iter + 1) % 10 == 0:
C = np.sum(P * np.log(P / Q))
record.append(Y)
if sym_sne:
C_ssne = np.sum(P * np.log(P / Q_ssne))
record_ssne.append(Y_ssne)
print("Iter %4d: tsne error = %f, sym-sne error = %f" %
(iter + 1, C, C_ssne))
else:
print("Iter %4d: tsne error = %f" % (iter + 1, C))
# Stop lying about P-values
if iter == 100:
P = P / 4.
# Return solution
if not sym_sne:
return record, P, Q
else:
return (record, record_ssne), P, (Q, Q_ssne)
```
### Result and Discussion
1. Visualize the embedding of both `t-SNE` and `symmetric SNE`
* t-sne with preplexity=64

* sym-sne with preplexity=64

2. Visualize the distribution of pairwise similarities in both high-dimensional space and low-dimensional space, based on both `t-SNE` and `symmetric SNE`
( P, Q with `t-SNE`, `sym-SNE`, respectively )

> We could see that `t-SNE`'s is more light, this is because t-distribution is more widly, when x is lower ,its probability is lower.
3. Play with different perplexity values

We could see that, when `preplexity` become large, `t-SNE` is more similar to `symmetric SNE`, I think this is related to entropy.
When `preplexity` is large, hints that the entropy is large, too. So the orignal distribution is more uniform.
Otherwise, when `preplexity` is small, hints that the entropy is small, and the original distribution is more similar to delta distribution.
And `t-SNE` is hard to tell the different in uniform distribution, because of the higer dimension data is more similar, and the result would be crowded.
For the small `preplexity` case, I observ that the `t-SNE` seems to split some labels.
I think this is because the higer dimension data become more dissimilar.