hw5 - HackMD

<style> H2{color:#E87765; font-weight;} H3{color:#89A85A; font-weight;} li {color:#4EA5C1;font-weight;} li li, p{color:Black; font-weight:normal; !important;} table, th {color:Black; font-weight:normal; font-size: 3px;!important;} </style> # hw5 ## Gaussian Process ### Code * **Task1** * Rational quadratic kernal * [$k_{RQ}(x, x')$](https://www.cs.cmu.edu/~epxing/Class/10708-15/notes/10708_scribe_lecture21.pdf) = $(1 + {\Vert x-x'\Vert^2 \over 2 \alpha l^2})^{-\alpha}$ * ![](https://i.imgur.com/mwLXTmJ.png) * Marginal likelihood * $p(y) = \int p(y|f)p(f)df = N(y|0, C)$ * Yet, there are noise $\epsilon \sim N(\cdot|0, \beta^{-1})$ * Thus, $C(x_n, x_m) = k(x_n, x_m) + \beta^{-1} \delta_{nm}$ * $\delta$ = 1 if n==m, otherwise 0, so we rewrite $\beta^{-1} \delta_{nm}$ to $\beta^{-1} I$ * ![](https://i.imgur.com/ayYSEEC.png) * Goal : Conditional Distribution $p(y^*|y) \sim N(\mu(x^*), \sigma(x^*)^2)$ * $\mu(x^*) = k(x, x^*)^T C^{-1}y$ * $\sigma^2(x^2) = k^* - k(x, x^*)^TC^{-1}k(x, x^*)$ * $k^* = k(x^*, x^*) + \beta^{-1}$ * ![](https://i.imgur.com/iumu0qw.png) * Visualization * Use z-score to mark 95% confidence interval where z = $x-\mu \over \sigma$ = $\pm$ 1.96 * ![](https://i.imgur.com/PFGE9NS.png) * **Task2 : Optimize the kernal** * Marginal likelihood $p(y|\theta) \sim N(y|0, C_\theta)$ * Negative marginal log likelihood J = $\ln p(y|\theta) = {1 \over 2}[N \ln(2\pi) + y^TC_\theta^{-1}y + \ln |C_\theta|]$ * Use J to transfer a maximization problem into minimization problem. * Call scipy.optimize.minimize to find optimal $\theta$ * $\theta_{new}$ = ```minimize(objective_function, theta)``` * ![](https://i.imgur.com/oaHO2ml.png) * Now, we can try gaussian process again using this updated hyperparameter $\theta_{new}$ ### Experiments * **Part1** * l = 1.0, α = 1.0 * ![](https://i.imgur.com/9DVtJvr.png) * **Part2** * l = 2.9705239253258946, α = 1026.4899617040603 * ![](https://i.imgur.com/G0HRWag.png) ### Observations and Discussion * Comparison between part1 & 2 * Optimization makes the 95% interval smaller, so the prediction is more precise. * Optimization makes the curve smoother, rather than overfitting. * Observation of confidence interval * Confidence interval(variance) is quite large at which without trainging data. * This is because the prediction is based on training data * In contrast, small confidence interval is usaually due to the presence of training data. ## SVM on MNIST ### Code * **Task1** * Wrap libsvm functions in svm() ```bash if __name__ == '__main__': # Task 1 svm('linear') svm('polynomial') svm('RBF') ``` * Refer to [document](https://www.csie.ntu.edu.tw/~cjlin/libsvm/) to see the meanings of parameters ``` -t kernel_type : set type of kernel function (default 2) 0 -- linear: u'*v 1 -- polynomial: (gamma*u'*v + coef0)^degree 2 -- radial basis function: exp(-gamma*|u-v|^2) ``` * svm() : self-defined wrapper function * ```bash kernel = {'linear': 0, 'polynomial': 1, 'RBF': 2} def svm(k): print(f'kernel_type : {k}') start = time.time() param = svm_parameter(f'-t {kernel[k]}') prob = svm_problem(Y_train, X_train) model = svm_train(prob, param) _, p_acc, _ = svm_predict(Y_test, X_test, model) end = time.time() print("Time: %0.2f seconds." % (end - start)) print() ``` * **Task2** * In soft-margin SVC, penalty C can be tuned. ```bash def grid_search_on_c(arg, max_acc): best_c = 1e-1 for c in [1e-2, 1e-1, 1e0, 1e1, 1e2]: param = svm_parameter(arg.format(c)) prob = svm_problem(Y_train, X_train) p_acc = svm_train(prob, param) if p_acc > max_acc: max_acc = p_acc best_c = c return max_acc, best_c ``` * grid_search() * Idea * In the [guide](https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf), it says that since doing a complete grid-search may still be time-consuming, it's recommended to use a coarse grid first. * After identifying a “better” region on the grid, a finer grid search on that region can be conducted. * Implement * Loops are used to try on various pairs of values to find. * The params for different types of kernal are as follows : ``` -c cost : set the parameter C of C-SVC, epsilon-SVR, and nu-SVR (default 1) -d degree : set degree in kernel function (default 3) -g gamma : set gamma in kernel function (default 1/k) -r coef0 : set coef0 in kernel function (default 0) -v : calling svm_train with -v n will return the best n-fold cross-validation accuracy. ``` * Linear kernal : -c C * polynomial kernal $(\gamma x_i^T x_j + r)^d, \gamma > 0$ : -c C, -g $\gamma$, -r d * RBF kernal : $exp({-\gamma}{\Vert x_i-x_j'\Vert}^2)$ : -c C, -g $\gamma$ * Code * ![](https://i.imgur.com/Ktu4XKE.png) * Note * When using polynomial method, I tried $\gamma$ on [1e-2, 1e-1, 1e0, 1e1, 1e2]at first, but the accuracy is low at the beginning(about 20%), and the process is extremely time-consuming. * Then I turned to another range [1e0, 1e1], and about 98% accuracy was achieved. The same work were done on degree, coef0 at the same time. * **Task3** * param * svm_param * ``` -t 4 -- precomputed kernel (kernel values in training_set_file) ``` * svm_problem * ``` isKernel=True for precomputed kernel ``` * Code * kernal functions ([fomula](https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf)): * linear kernal : $x_i x_j^T$ * RBF kernal : $exp({-\gamma}{\Vert x_i-x_j'\Vert}^2)$ * ![](https://i.imgur.com/nuZ57dE.png) * Do grid search on the new model as task2 * ![](https://i.imgur.com/1xaeFeg.png) ### Experiments * **Task1** * <img src='https://i.imgur.com/BDdJCFD.png' width=300> * * **Task2** * <img src='https://i.imgur.com/AuIWedc.png' width=300> * <img src='https://i.imgur.com//vLCQ2K3.png' width=300> * <img src='https://i.imgur.com/D1MACPo.png' width=300> * **Task3** * <img src='https://i.imgur.com/6g8bhnC.png' width=300> ### Observations and Discussion * **Task1** * Let's have a summary first. * | |linear|polynomial|RBF| |----|------|----------|---| |Test-Acc(%)|100|34.34|96.88| |Time(s)|6.6|55.99|14.18| * Comparison * Linear kernal has the best accuracy and best performance. * RBF kernal also has a nice accuracy and acceptable performance * Polynomial kernal has the worst accracy and performance. * Discussion * The data seems linearly seperable, so the linear kernal is good enough for classificaiton. * In the [guide](https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf), it is said that "after searching the (C, γ) space, RBF will performs at least as good as linear". * However, before tunung the hyperparameter, RBF may perform worse than linear kernal. * **Task2** * Let's have a summary first. * | |linear|polynomial|RBF| |----|------|----------|---| |CV-Acc(%)|97.12|98.26|98.44| |Test-Acc(%)|99.2|100|100| |params|-c 0.01|-c 100 -g -10 d 2 -r 10|-g 0.01 -c 100| |Time(s)|57.34|475.81|1042.24| * About the ==performance of polynomial and RBF kernal== * Since polynomial and RBF kernal are too time-consuming, I have decreased the range of grid to [1e0, 1e1] and [1e-3, 1e-2, 1e-1] respectively after trying on a larger range(i.e., [1e-2, 1e-1, 1e0, 1e1,1e2]). * Polynomial and RBF kernal require more params, so it's reasonable that they are slow when the params are not picked properly. * About ==linear and RBF== * After tuning, RBF performs better than either linear or polynomial kernal, but it's time-consuming. * Thus, if the data is linearly seperable, using the linear kernel is good enough and moreover, the model will run faster. Another reason is that if the number of features is large, mapping data to a high-dim space doesn't improve the performance much. * However, to persue the best accuracy, RBF is still a nice and more genral method. * **Task3** * Let's have a summary first. * | |linear|linear+RBF|RBF| |----|------|----------|---| |CV-Acc(%)|97.12|97.14|98.44| |Test-Acc(%)|99.2|99.26|100| |params|-c 0.01|-c 0.1 -g 0.01|-g 0.01 -c 100| |param range|range of c are same on the 3 model|g=[1e-3, 1e-2, 1e-1, 1e0, 1e1]|g=[1e-3, 1e-2, 1e-1]| |Time(s)|57.34|652.47|1042.24| * Obervation * The accuracy of this new combined model is between linear and RBF. * The performance of this new combined model is also between linear and RBF. * Discussion * The new model seems to be a compromise method, especially when one asks for performance but doesn't care too much about accuracy. * In this case, since the linear kernal is good enough, I think the other 2 models are still not necessary. ###### tags: `Machine Learning`