# 2020 Machine Learning Homework 7 School: National Chiao Tung University Name: Shao-Wei ( Willy ) Chiu ###### tags: `NCTU`, `Machine Learning`, `Embedding`, `PCA`, `LDA`, `SNE`, `t-SNE`, `GitHub` > Note: This report was made on `Hackmd.io` and restricted by the `.pdf` format, the `.gif` animation would not display. Please view it on [https://hackmd.io/@swchiu/SJQehejA8](https://hackmd.io/@swchiu/SJQehejA8), thanks. ## PCA / LDA When dataset's feature space grows into very high dimensions, there would be some useless feature. We prefer to extract the more important features for our dataset. let $$ x_i = \begin{bmatrix}x_{i1} & x_{i2} & ... & x_{id} \end{bmatrix} \\ X_{n\times d} = \begin{bmatrix}x_1 \\x_2 \\... \\x_n \end{bmatrix} $$ ### PCA (`PCA.py`) `PCA` is a method to find a project matrix $W_{d\times k}$, which could project data $X_{n\times d}$ into $Z_{n\times k}$, i.e., $$ Z_{n\times k} = X_{n\times d}W_{d\times k} $$ and $W_{d\times k}$ let $Z_{n\times k}$ has maximal variance, which hints that $Z_{n\times k}$ also has minimal square error between $X_{n\times d}$ $$ \begin{aligned} \underset{W}{argmax}(\sigma^2(Z)) &= \frac{1}{n}\sum(z_i-\overline Z)^T(z_i-\overline Z) \\ &= \frac{1}{n}\sum(x_iw_i-\overline x_iw_i)^T(x_iw_i-\overline x_iw_i) \\ &= \frac{1}{n}\sum((x_i-\overline x_i)w_i)^T((x_i-\overline x_i)w_i) \\ &= W^T\frac{1}{n}\sum((x_i-\overline x_i)^T(x_i-\overline x_i))W \\ &= W^TSW \ (\ diagonal\ value\ ) \end{aligned} $$ By `Rayleigh Quotient`, the first `k` large eigenvector of $S$ is our $W$, ( i.e., solving $Sv = \lambda v$ ) Look at $S_{d\times d}$ matrix, it is time consuming in solving its eigen-problem when $d$ is very large. We could do a little trick on it. Let $$ \begin{aligned} \phi(X) &= X-\overline X \\ C_{n\times n} &= \frac{1}{n}(X-\overline X)(X-\overline X)^T = \frac{1}{n}\phi(X)\phi(X)^T\\ S_{d\times d} &= \frac{1}{n}(X-\overline X)^T(X-\overline X) = \frac{1}{n}\phi(X)^T\phi(X) \end{aligned} $$ Then, solving $$ \begin{aligned} Cv &= \lambda v \\ \phi(X)^TCv &= \lambda \phi(X)^Tv \\ \phi(X)^T\frac{1}{n}\phi(X)\phi(X)^Tv &= \lambda \phi(X)^Tv \\ (\frac{1}{n}\phi(X)^T\phi(X))\phi(X)^Tv &= \lambda \phi(X)^Tv \\ S(\phi(X)^Tv) &= \lambda (\phi(X)^Tv) \\ SW &= \lambda W \end{aligned} $$ So, we could solve eigen-problem in $n\times n$ dimension to get $W = \phi(X)^TV = (X-\overline X)^TV$. Then, we could project data such like $XW$ and $X_{test}W$. ```python def __get_covariance(self, datas): n = datas.shape[0] if self.is_kernel: ... else: # This part use a little trick S = datas - np.sum(datas, axis=0) / n S = np.matmul(S, S.T) / n return S ... def run(self): S = self.__get_covariance(self.datas) e_vals, e_vecs = self.__get_sorted_eigen(S, self.k) mean_datas = self.datas - np.sum( self.datas, axis=0) / self.datas.shape[0] if self.is_kernel: ... else: W = np.matmul(mean_datas.T, e_vecs) W /= np.sqrt(e_vals) pca_space = np.matmul(self.datas, W) return pca_space, W ``` ```python ... pca = PCA(train_faces, k=25) train_space, W_train = pca.run() test_space = np.matmul(test_faces, W_train) ... ``` ### kernel PCA (`PCA.py`) At the first, mapping datas into feature space. $$ X_{n\times d} \rightarrow \phi(X)_{n\times m} $$ and then try to calculate the covariance as `PCA` do. $$ \begin{aligned} cov(\phi(X)) &= \frac{1}{n}(\phi(X)-\frac{1}{n}\sum\phi(X))^T(\phi(X)-\frac{1}{n}\sum\phi(X)), (\ let\ \Psi(X) = \phi(X)-\frac{1}{n}\sum\phi(X))\\ &= \frac{1}{n}\Psi(X)^T\Psi(X) \\ &= S \end{aligned} $$ We want to solve $SW=\lambda W$, but $\Psi(X)$ is not explict. ==Assume $X$ is $d\times n$==, $$ \begin{aligned} Sw &= \lambda w \\ \frac{1}{n}XX^Tw &= \lambda w \\ \frac{1}{n\lambda} \sum x_i(x_i^Tw) &= w \\ \sum[\frac{1}{n\lambda}(x_i^Tw)]x_i &= w \\ \sum \alpha_i x_i&= w \\ XA &= w \end{aligned} $$ We could observ that the eigenvector is linear combinition of $X$, so we could rewrite the eigen-problem as following: ==( here $X$ is $n\times d$ )== $$ \begin{aligned} \frac{1}{n}\Psi(X)^T\Psi(X)w &= \lambda w \\ \frac{1}{n}\Psi(X)^T\Psi(X)\Psi(X)^TA &= \lambda \Psi(X)^TA \\ \frac{1}{n}\Psi(X)^T\Psi(X)\Psi(X)^TA &= \lambda \Psi(X)^TA \\ \frac{1}{n}\Psi(X)\Psi(X)^T\Psi(X)\Psi(X)^TA &= \lambda \Psi(X)\Psi(X)^TA \\ \frac{1}{n}K^cK^cA=\lambda K^cA \\ \frac{1}{n}K^cA=\lambda A \end{aligned} $$ What is $K^c$ ? $$ \begin{aligned} let\ K &= \phi(X)\phi(X)^T \\ K^c &= \Psi(X)\Psi(X)^T \\ &= (\phi(X)-\frac{1}{n}\sum\phi(X))(\phi(X)-\frac{1}{n}\sum\phi(X))^T \\ &= \phi(X)\phi(X)^T-(\frac{1}{n}\sum\phi(X))\phi(X)^T-\phi(X)(\frac{1}{n}\sum\phi(X)^T)+\frac{1}{n^2}\sum\phi(X)\phi(X)^T \\ &= K-K1_n-1_nK+1_nK1_n,\ (\ 1_n\ is\ a\ n\times n\ matrix\ whose\ elements\ are\ all\ \frac{1}{n}\ ) \end{aligned} $$ ```python def __get_covariance(self, datas): n = datas.shape[0] if self.is_kernel: N1 = np.ones((n, n)) / n S = (datas - np.matmul(N1, datas) - np.matmul(datas, N1) + np.matmul(np.matmul(N1, datas), N1)) return S else: ... ``` How to project data? $$ \begin{aligned} \phi(X)W &= \phi(X)\Psi(X)^TA \\ &= \phi(X)(\phi(X)-\frac{1}{n}\sum\phi(X))^TA \\ &= (\phi(X)\phi(X)^T-\phi(X)(\frac{1}{n}\sum\phi(X)^T))A \\ &= (K-1_nK)A \end{aligned} $$ ```python def run(self): S = self.__get_covariance(self.datas) e_vals, e_vecs = self.__get_sorted_eigen(S, self.k) mean_datas = self.datas - np.sum( self.datas, axis=0) / self.datas.shape[0] if self.is_kernel: W = e_vecs W /= np.sqrt(e_vals) N1 = np.ones(self.datas.shape) / self.datas.shape[0] pca_space = np.matmul((self.datas - np.matmul(N1, self.datas)), W) else: ... ... K_train = kernel(train_faces, train_faces) K_test_train = kernel(test_faces, train_faces) kpca = PCA(K_train, k=25, is_kernel=True) train_space, alpha = kpca.run() ``` How about projecting new data $X_{test}$ ? $$ \begin{aligned} \phi(X_{test})W &= \phi(X_{test})\Psi(X)^TA \\ &= \phi(X_{test})(\phi(X)-\frac{1}{n}\sum\phi(X))^TA \\ &= (\phi(X_{test})\phi(X)^T-\phi(X_{test})(\frac{1}{n}\sum\phi(X)^T))A \\ &= (K_{test\_train}-1_{n:m\times n}K)A \end{aligned} $$ ``` python K_train = kernel(train_faces, train_faces) K_test_train = kernel(test_faces, train_faces) kpca = PCA(K_train, k=25, is_kernel=True) train_space, alpha = kpca.run() NM1 = np.ones(K_test_train.shape) / K_train.shape[0] ### Project testing data at next line test_space = np.matmul(K_test_train - np.matmul(NM1, K_train), alpha) ``` ### LDA (`LDA.py`) Not like `PCA` is projecting data into eigen-features space. `LDA` projecting data for maximize the distance between different labels/clusters, and minimize the distance whthin labels/clusters. solving project matrix $W$ as eigen-problem below: $$ S_W^{-1}S_Bw=\lambda w, \ where\\ S_W = \sum_{j=1}^k S_j,\ S_j = \sum (x_i-m_j)(x_i-m_j)^T,\ m_j = \frac{1}{n_j}\sum_{i \in C_j}x_i \\ S_B = \sum_{j=1}^k S_{B_{j}} = \sum_{j=1}^k n_j(m_j-m)(m_j-m)^T $$ But $S_W$ might singular when $n < d$, So We could find its pseudo inverse, instead. In our work, $n$ is actually smaller than $d$, so I find pseudo inverse directly. ```python class LDA(): def __init__(self, datas, labels, k, is_kernel=False): self.datas = datas self.n = datas.shape[0] self.labels = labels self.k = k self.is_kernel = is_kernel def __count_labels(self): C = np.zeros((self.n, len(np.unique(self.labels)))) for idx, j in enumerate(np.unique(self.labels)): C[self.labels == j, idx] = 1 return C def __get_Sb_Sw(self, C): Mj = np.matmul(self.datas.T, C) / np.sum(C, axis=0) M = np.sum(self.datas.T, axis=1) / self.datas.shape[0] B = Mj - M[:, None] Sb = np.matmul(B * np.sum(C, axis=0), B.T) W = self.datas.T - np.matmul(Mj, C.T) Sw = np.zeros(Sb.shape) for group in np.unique(self.labels): w = W[:, self.labels == group] Sw += (np.matmul(w, w.T) / w.shape[1]) return Sb, Sw def __get_sorted_eigen(self, A, k): eigenvalues, eigenvectors = np.linalg.eigh(A) sorted_idx = np.flip(np.argsort(eigenvalues)) sorted_eigenvalues = [] sorted_eigenvectors = [] for i in range(k): vector = eigenvectors[:, sorted_idx[i]] sorted_eigenvectors.append(vector[:, None]) sorted_eigenvalues.append(eigenvalues[sorted_idx[i]]) sorted_eigenvalues = np.array(sorted_eigenvalues) sorted_eigenvectors = np.concatenate(sorted_eigenvectors, axis=1) return sorted_eigenvalues, sorted_eigenvectors def run(self): C = self.__count_labels() Sb, Sw = self.__get_Sb_Sw(C) #try: # Sw_inv = np.linalg.inv(Sw) #except: # Sw_inv = np.linalg.pinv(Sw) Sw_inv = np.linalg.pinv(Sw) obj_matrix = np.matmul(Sw_inv, Sb) value, vector = self.__get_sorted_eigen(obj_matrix, self.k) return np.matmul(self.datas, vector), vector ``` ### Kernel LDA (`LDA.py`) Referrence from [wikipedia](https://en.wikipedia.org/wiki/Kernel_Fisher_discriminant_analysis). > But I could not map testing data to the training's space correctly until the deadline coming, so my accurancy of face-reconition is low...... ### K-Nearest Neighbor Decide the data belone to the cluster which has the largest numbers datas in its k-nearest neighbors. ```python import numpy as np class KNN(): def __init__(self, train_datas, test_datas, labels, k=5): self.train_datas = train_datas self.test_datas = test_datas self.labels = labels self.k = k def __eucidiance(self, U, V): return np.matmul(U**2, np.ones( (U.shape[1], V.shape[0]))) - 2 * np.matmul(U, V.T) + np.matmul( np.ones((U.shape[0], V.shape[1])), (V.T)**2) def run(self): dist = self.__eucidiance(self.test_datas, self.train_datas) closet = np.argsort(dist, axis=1) y = [] for i in range(self.test_datas.shape[0]): for j in range(self.k): y.append(self.labels[closet[i][j]]) y = np.array(y) y = y.reshape(self.test_datas.shape[0], self.k) count = [] for i in range(len(self.labels)): count.append(np.count_nonzero(y == self.labels[i], axis=1)) count = np.vstack(count) y = np.argmax(count, axis=0) predict = [None] * self.test_datas.shape[0] for i in range(self.test_datas.shape[0]): predict[i] = self.labels[y[i]] return predict ... knn = KNN(train_space, test_space, train_labels, k=5) predict = knn.run() ``` ### Kernel ( in `util.py` ) I use three different kernel, linear, polynomial and rbf. ```python def euclidean(u, v): return np.matmul(u**2, np.ones( (u.shape[1], v.shape[0]))) - 2 * np.matmul(u, v.T) + np.matmul( np.ones((u.shape[0], v.shape[1])), (v.T)**2) def rbf(u, v, g=10**-11): return np.exp(-1 * g * euclidean(u, v)) def linear(u, v): return np.matmul(u, v.T) def polynomial(u, v, g=0.7, coef0=10, d=5): return ((g * np.matmul(u, v.T)) + coef0)**d ``` ### Other Methods in My Implementation ( in `util.py` ) `read_face` method read the dataset and resize the image into `100*100`. `show_face` method combine each face matrix in column and the row for visualize. ```python def read_faces(dir_path): faces = [] labels = [] for filename in os.listdir(dir_path): with PIL.Image.open(dir_path + filename) as im: im = im.resize((100, 100), PIL.Image.BILINEAR) faces.append([np.array(im).reshape(1, -1)]) labels.append(filename.split('.', 1)[0]) faces = np.concatenate(faces, axis=0) faces = faces.reshape(faces.shape[0], faces.shape[2]) faces = faces.astype('int64') return faces, labels def show_faces(faces, filename=None, col=5): #f = faces.reshape(-1,231,195) f = faces.reshape(-1, 100, 100) n = f.shape[0] all_faces = [] for i in range(int(n / col)): all_faces.append([np.concatenate(f[col * i:col * (i + 1)], axis=1)]) all_faces = np.concatenate(all_faces[:], axis=1) all_faces = all_faces.reshape(all_faces.shape[1], all_faces.shape[2]) plt.figure(figsize=(1.5 * col, 1.5 * n / col)) plt.title(filename) plt.imshow(all_faces, cmap='gray') plt.subplots_adjust(left=0.05, right=0.95, top=0.85, bottom=0.15) if (filename): plt.savefig('./output/' + filename) else: plt.show() ``` ### Result and Discussion 1. The first 25 eigenfaces and fisherfaces ![](https://i.imgur.com/hCHyk2l.png) ![](https://i.imgur.com/BnUTdNY.png) 2. Reconstruct random 10 faces ![](https://i.imgur.com/gr2aiNC.png) ![](https://i.imgur.com/Q6Zdq69.png) ![](https://i.imgur.com/bigt9DH.png) 3. and 4. Doing face recognition via PCA/KPCA and LDA/KLDA ``` terminal (base) swchiu@gpuserv1:~/Embedding$ python3 main.py PCA Face-reconition accuracy: 27/30 = 90.00% Kernel PCA (rbf) Face-reconition accuracy: 27/30 = 90.00% (polynomial) Face-reconition accuracy: 25/30 = 83.33% (linear) Face-reconition accuracy: 27/30 = 90.00% LDA Face-reconition accuracy: 28/30 = 93.33% Kernel LDA (rbf) Face-reconition accuracy: 24/30 = 80.00% (polynomial) Face-reconition accuracy: 18/30 = 60.00% (linear) Face-reconition accuracy: 18/30 = 60.00% ``` `LDA` is better than `PCA`, I think the reason is that `LDA` could split the data more widly in its concept, but `PCA` only to mapping data for preserve the maximum variance. In this case, we use `KNN` to classify data, which is based on euclidiean distance, and `LDA` could split widly in euclidiean distance. `RBF` and `Linear` looks similar in `KPCA`, and `RBF` performs best in `KLDA`. `Polynomial` performs worst in both `KPCA` and `KLDA`. But the results of `KLDA` were not in my expectation. I think the problem is that I use the wrong mapping way to testing data. So I tried to derive the formula on my own, but I could not make a sence until the deadline coming. ## t-SNE / Symmetric SNE Referrence to the [https://lvdmaaten.github.io/tsne/code/tsne_python.zip](https://lvdmaaten.github.io/tsne/code/tsne_python.zip) and modify from it. ### Modify t-SNE into symmetric SNE `t-SNE` using a different distribution on the reducted data ( i.e, `student t-distribution`, which is a more widly distribution than `Gaussion` ). Because `t-SNE` use a more widly distribution, the crowded problem would be sloved. ![](https://i.imgur.com/alTK0t4.png) > We could see that when the data is more closed, the distance in `t-SNE` is more far. The crowded problem which `symmetric SNE` would face, I will show at the result part below. `SNE`'s concept is that it want to preserve the probabilitic distribution after the reduction. Look at the definition of probability of `symmetreic SNE` and `t-SNE`, $$ \begin{aligned} symmetric\ SNE: \\ p_{ij} &= \frac{exp(-||x_i - x_j||^2/(2\sigma^2))}{\sum_{k \neq i} exp(-||x_l - x_k||^2/(2\sigma^2))} \\ q_{ij} &= \frac{exp(-||y_i - y_j||^2)}{\sum_{k \neq l} exp(-||y_l - y_k||^2)} \\ t-SNE: \\ p_{ij} &= \frac{exp(-||x_i - x_j||^2/(2\sigma^2))}{\sum_{k \neq l} exp(-||x_l - x_k||^2/(2\sigma^2))} \\ q_{ij} &= \frac{(1+||y_i - y_j||^2)^{-1}}{\sum_{k \neq l} (1+||y_l - y_k||^2)^{-1}} \end{aligned} $$ We could see that the different is at $q_{ij}$ part, `t-SNE` using an different distribution to describe it. The gradient descent derivation is also different due to different $q_{ij}$ $$ \begin{aligned} C &= KL(P||Q)=\sum_{i}\sum_{j \neq i}p_{ij}\log\frac{p_{ij}}{q_{ij}}\\ symmetric\ SNE: \\ \\ \frac{\delta C}{\delta y_i} &= 2\sum_{j}(p_{ij}-q_{ij})(y_i-y_j) \\ \\ t-SNE: \\ \\ \frac{\delta C}{\delta y_i} &= 4\sum_{j}(p_{ij}-q_{ij})(y_i-y_j)(1+||y_i - y_j||^2)^{-1} \end{aligned} $$ So, we only need to find out the specific code segment related to those part and modify it into `symmetric SNE` form. > Look at $C$, we could observ that when the data is more closed in higher dimension, then the data would not be sparse, too. > But if datas are sparse in higher dimension, datas in lower dimension might be closed! > [name=Shao-Wei Chiu] The first part is shown below(`line 14 to line 15`): ```python=1 ... # Compute pairwise affinities sum_Y = np.sum(np.square(Y), 1) num = -2. * np.dot(Y, Y.T) num = 1. / (1. + np.add(np.add(num, sum_Y).T, sum_Y)) num[range(n), range(n)] = 0. Q = num / np.sum(num) Q = np.maximum(Q, 1e-12) if sym_sne: sum_Y_ssne = np.sum(np.square(Y_ssne), 1) num_ssne = -2. * np.dot(Y_ssne, Y_ssne.T) # different between t-sne region num_ssne = np.add(np.add(num_ssne, sum_Y_ssne).T, sum_Y_ssne) num_ssne = np.exp(-1 * num_ssne) # different between t-sne region num_ssne[range(n), range(n)] = 0. Q_ssne = num_ssne / np.sum(num_ssne) Q_ssne = np.maximum(Q_ssne, 1e-12) ... ``` And the second part is `line 9 to line 10`: ```python= ... dY[i, :] = np.sum( np.tile(PQ[:, i] * num[:, i], (no_dims, 1)).T * (Y[i, :] - Y), 0) if sym_sne: # different between t-sne dY_ssne[i, :] = np.sum( np.tile(PQ_ssne[:, i], (no_dims, 1)).T * (Y_ssne[i, :] - Y_ssne), 0) ... ``` I modify it to perform `t-SNE` and `symmetric SNE` by calling `tsen()` method only, so I copy some parameter of `t-SNE` for running `symmetric SNE` at the same time. And I stored the `Y` into a list at each ten steps for visualization. I implement some useful method for visualization ( e.g., `make_gif(...)`, `show_similarity(...)`), too. ```python def make_gif(record, labels, method, perplexity): camera = Camera(plt.figure()) plt.title(method + ' with Perplexity=' + str(perplexity)) for i in range(len(record)): img = plt.scatter(record[i][:, 0], record[i][:, 1], 20, labels) camera.snap() anim = camera.animate(interval=5, repeat_delay=20) anim.save( 'output/' + method + '_' + str(perplexity) + '.gif', writer='pillow') plt.scatter(record[-1][:, 0], record[-1][:, 1], 20, labels) plt.savefig('output/' + method + '_' + str(perplexity) + '.png') def show_similarity(S, labels, title, filename, perplexity): n = len(S) sort_idx = np.concatenate( [np.where(labels == l)[0] for l in np.unique(labels)]) plt.figure(figsize=(10 * n, 7.5)) S = [np.log(p[:, sort_idx][sort_idx, :]) for p in S] all_min = min([np.min(p) for p in S]) all_max = max([np.max(p) for p in S]) for i in range(n): plt.subplot(1, n, i + 1) plt.title(title[i]) im = plt.imshow(S[i], cmap='gray', vmin=all_min, vmax=all_max) plt.colorbar(im) plt.savefig('output/' + filename + '_' + str(perplexity) + '.png') ``` ```python def gradient_descend(iter, Y, iY, dY, gains, min_gain, momentum, eta): gains = (gains + 0.2) * ((dY > 0.) != (iY > 0.)) + (gains * 0.8) * ( (dY > 0.) == (iY > 0.)) gains[gains < min_gain] = min_gain iY = momentum * iY - eta * (gains * dY) Y = Y + iY Y = Y - np.tile(np.mean(Y, 0), (Y.shape[0], 1)) return Y, iY, gains def tsne( X=np.array([]), no_dims=2, initial_dims=50, perplexity=30.0, sym_sne=False): """ Runs t-SNE on the dataset in the NxD array X to reduce its dimensionality to no_dims dimensions. The syntaxis of the function is `Y = tsne.tsne(X, no_dims, perplexity), where X is an NxD NumPy array. """ # Check inputs if isinstance(no_dims, float): print("Error: array X should have type float.") return -1 if round(no_dims) != no_dims: print("Error: number of dimensions should be an integer.") return -1 # Initialize variables (n, d) = X.shape max_iter = 1000 initial_momentum = 0.5 final_momentum = 0.8 eta = 500 min_gain = 0.01 record = [] Y = np.random.randn(n, no_dims) dY = np.zeros((n, no_dims)) iY = np.zeros((n, no_dims)) gains = np.ones((n, no_dims)) if sym_sne: record_ssne = [] Y_ssne = Y.copy() dY_ssne = np.zeros((n, no_dims)) iY_ssne = np.zeros((n, no_dims)) gains_ssne = np.ones((n, no_dims)) Q_ssne = np.zeros((n, n)) # Compute P-values P = x2p(X, 1e-5, perplexity) P = P + np.transpose(P) P = P / np.sum(P) P = P * 4. # early exaggeration P = np.maximum(P, 1e-12) # Run iterations for iter in range(max_iter): # Compute pairwise affinities sum_Y = np.sum(np.square(Y), 1) num = -2. * np.dot(Y, Y.T) num = 1. / (1. + np.add(np.add(num, sum_Y).T, sum_Y)) num[range(n), range(n)] = 0. Q = num / np.sum(num) Q = np.maximum(Q, 1e-12) if sym_sne: sum_Y_ssne = np.sum(np.square(Y_ssne), 1) num_ssne = -2. * np.dot(Y_ssne, Y_ssne.T) # different between t-sne region num_ssne = np.add(np.add(num_ssne, sum_Y_ssne).T, sum_Y_ssne) num_ssne = np.exp(-1 * num_ssne) # different between t-sne region num_ssne[range(n), range(n)] = 0. Q_ssne = num_ssne / np.sum(num_ssne) Q_ssne = np.maximum(Q_ssne, 1e-12) # Compute gradient PQ = P - Q if sym_sne: PQ_ssne = P - Q_ssne for i in range(n): dY[i, :] = np.sum( np.tile(PQ[:, i] * num[:, i], (no_dims, 1)).T * (Y[i, :] - Y), 0) if sym_sne: # different between t-sne dY_ssne[i, :] = np.sum( np.tile(PQ_ssne[:, i], (no_dims, 1)).T * (Y_ssne[i, :] - Y_ssne), 0) # Perform the update if iter < 20: momentum = initial_momentum else: momentum = final_momentum Y, iY, gains = gradient_descend(iter, Y, iY, dY, gains, min_gain, momentum, eta) if sym_sne: Y_ssne, iY_ssne, gains_ssne = gradient_descend( iter, Y_ssne, iY_ssne, dY_ssne, gains_ssne, min_gain, momentum, eta) # Compute current value of cost function if (iter + 1) % 10 == 0: C = np.sum(P * np.log(P / Q)) record.append(Y) if sym_sne: C_ssne = np.sum(P * np.log(P / Q_ssne)) record_ssne.append(Y_ssne) print("Iter %4d: tsne error = %f, sym-sne error = %f" % (iter + 1, C, C_ssne)) else: print("Iter %4d: tsne error = %f" % (iter + 1, C)) # Stop lying about P-values if iter == 100: P = P / 4. # Return solution if not sym_sne: return record, P, Q else: return (record, record_ssne), P, (Q, Q_ssne) ``` ### Result and Discussion 1. Visualize the embedding of both `t-SNE` and `symmetric SNE` * t-sne with preplexity=64 ![t-sne with preplexity=64](https://i.imgur.com/kX2A7nj.gif) * sym-sne with preplexity=64 ![sym-sne with preplexity=64](https://i.imgur.com/G8B0Rdh.gif) 2. Visualize the distribution of pairwise similarities in both high-dimensional space and low-dimensional space, based on both `t-SNE` and `symmetric SNE` ( P, Q with `t-SNE`, `sym-SNE`, respectively ) ![](https://i.imgur.com/d9EWBvA.jpg) > We could see that `t-SNE`'s is more light, this is because t-distribution is more widly, when x is lower ,its probability is lower. 3. Play with different perplexity values ![](https://i.imgur.com/z8kdfSv.jpg) We could see that, when `preplexity` become large, `t-SNE` is more similar to `symmetric SNE`, I think this is related to entropy. When `preplexity` is large, hints that the entropy is large, too. So the orignal distribution is more uniform. Otherwise, when `preplexity` is small, hints that the entropy is small, and the original distribution is more similar to delta distribution. And `t-SNE` is hard to tell the different in uniform distribution, because of the higer dimension data is more similar, and the result would be crowded. For the small `preplexity` case, I observ that the `t-SNE` seems to split some labels. I think this is because the higer dimension data become more dissimilar.