# 2020 Machine Learning Homework 5
School: National Chiao Tung University
Name: Shao-Wei ( Willy ) Chiu
###### tags: `NCTU`, `Machine Learning`, `Gaussian Process`, `SVM`, `GitHub`
## Gaussian Process
* Training data:
* $\begin{bmatrix}
x_1 & y_1 \\
. & . \\
. & . \\
. & . \\
x_{n} & y_{n}
\end{bmatrix} = \begin{bmatrix}
X & Y
\end{bmatrix}$
* Kernel function:
* $k(x_n,x_m)=\sigma^2{(1+\frac{{\| x_n-x_m\|}^2}{2\alpha\ell^2})}^{-\alpha}$
There is a function $f$ could transfer each $x_i$ into coressponding $y_i$ ( i.e., $f(x_i) = y_i$ ).
Assume that $y_i = f(x_i) + \epsilon, where \ \epsilon \sim N(0, \beta)$ and $f \sim N(0, K_n)$.
(i.e., $Y\sim N(f, \beta)$)
On estimate the $x_*$ point, we have formula $\begin{bmatrix}Y\\ y_*\end{bmatrix} \sim N(\begin{bmatrix}Y\\ y_*\end{bmatrix}|0, K_{n+1})$
After the [derivation of probability](https://www.csie.ntu.edu.tw/~cjlin/mlgroup/tutorials/gpr.pdf), we get the
* $\mu(x_*) = k(x, x_*)^T(K_n+\beta I)^{-1}Y$
* $cov(x_*) = k(x_*, x_*)-k(x, x_*)^T(K_n+\beta I)^{-1}k(x, x_*)$
> This form is almost the same as the formula mentioned by Prof. Chiu in the class.
> The tiny difference is that it take the $\beta$ out of the matrix $K$, but the course silde takes it into the matrix.
> [name=Willy Chiu]
Thus, we could apply the $x_*$ to describe our model.
We notice that the kernel method is decided by some kernel parameters ( e.g., $\sigma$, $\alpha$, $\ell$ ), so we need to find the parameters which could have the maximum likelihood.
In my practice, I choose the random value of all parameter between 0 and 10, and call the `scipy.optimize.minimize` to optimize it.
The relative formula is shown below,
$argmax(ln\ p(y\ |\ \theta)) = -\frac{1}{2}ln\ |C_{\theta}|-\frac{1}{2}y^TC_{\theta}^{-1}-\frac{N}{2}ln\ (2\pi)$
$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \propto-ln\ |C_{\theta}|-y^TC_{\theta}^{-1}y$
$\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ =argmin(\ ln\ |C_{\theta}|+y^TC_{\theta}^{-1}y\ )$
### Step1
Use `get_K` to compute the covarince matrix, and `get_k_test` use to compute the $k(x, x_*)$ matrix.
```python
def rq_kernel(xn, xm, length_scale, scale_mixture, amplitude):
delta = abs(xn-xm)
return amplitude * (1 + delta**2/(2*scale_mixture*(length_scale**2)))**(-scale_mixture)
def get_K(X, length_scale, scale_mixture, amplitude, beta):
n = len(X)
K = np.zeros((n, n), dtype=np.float32)
for i in range(0, n):
for j in range(0, n):
K[i, j] = rq_kernel(X[i], X[j], length_scale, scale_mixture, amplitude)
if i==j:
K[i, j] += 1/beta
return K
def get_k_test(test, train, length_scale, scale_mixture, amplitude):
n = len(train)
K = np.zeros((n, 1), dtype=np.float32)
for i in range(0, n):
K[i, 0] = rq_kernel(train[i], test, length_scale, scale_mixture, amplitude)
return K
```
### Step2
Initial the value of parameters.
```python
#given assumption
beta = 5.0
#initial kernel parameter
length_scale = rd.uniform(1, 10.0)
scale_mixture = rd.uniform(1, 10.0)
amplitude = rd.uniform(1, 10.0)
K = get_K(train_x, length_scale, scale_mixture, amplitude, beta)
```
### Step3
Add the $x_*$ for the range of [-60, 60], and apply the formula derived above.
```python
test_x = np.arange(-60, 60, 0.5)
test_y = np.zeros((len(test_x), 1))
test_var = np.zeros((len(test_x), 1))
for i in range(len(test_x)):
k = get_k_test(test_x[i], train_x, length_scale, scale_mixture, amplitude)
test_y[i] = np.matmul(np.matmul(np.transpose(k), np.linalg.inv(K)), train_y)
k_new = rq_kernel(test_x[i], test_x[i], length_scale, scale_mixture, amplitude) + 1/beta
test_var[i] = k_new - np.matmul(np.matmul(np.transpose(k), np.linalg.inv(K)), k)
```
### Step4
Use `scipy.optimize.minimize` to find the optimized parameter and re-do the gaussian process.
> I found that if using random value of kernel parameters as its initial value, the result after optmizing might be bad for some extremly initial value.
> [name=Willy Chiu]
```python
#optimize the kernel parameters
def fun(x, args):
X, Y, beta = args
K = get_K(X, x[0], x[1], x[2], beta)
v = np.log(np.linalg.det(K))+np.matmul(np.matmul(np.transpose(train_y), np.linalg.inv(K)), train_y)
return v
args = [train_x, train_y, beta]
cons = ({'type': 'ineq', 'fun': lambda x: x[0] - 0.1},
{'type': 'ineq', 'fun': lambda x: x[1] - 0.1},
{'type': 'ineq', 'fun': lambda x: x[2] - 0.1})
x0 = np.array((length_scale, scale_mixture, amplitude))
res = minimize(fun, x0, args=[train_x, train_y, beta], method='SLSQP', constraints=cons)
length_scale, scale_mixture, amplitude = res.x
# re-try Gaussian Process with optimized parameter again
K = get_K(train_x, length_scale, scale_mixture, amplitude, beta)
test_x = np.arange(-60, 60, 0.5)
test_y = np.zeros((len(test_x), 1))
test_var = np.zeros((len(test_x), 1))
for i in range(len(test_x)):
k = get_k_test(test_x[i], train_x, length_scale, scale_mixture, amplitude)
test_y[i] = np.matmul(np.matmul(np.transpose(k), np.linalg.inv(K)), train_y)
k_new = rq_kernel(test_x[i], test_x[i], length_scale, scale_mixture, amplitude) + 1/beta
test_var[i] = k_new - np.matmul(np.matmul(np.transpose(k), np.linalg.inv(K)), k)
```
* Result:

### Referrence
* https://www.csie.ntu.edu.tw/~cjlin/mlgroup/tutorials/gpr.pdf
## Support Vector Machines
* Request:
1. Use different kernel and compare them
2. Use C-SVM and grid search for the best parameters
3. Use precomputed kernel
* linear + RBF
* `libsvm` is available
### Step1
Compute the user-defined kernel and convert the given training data in `libsvm` format, for example:
```python
#given training data
1,2,3,4,5
2,3,4,5,6
#given training label
5
1
#libsvm-format training data
5 1:1 2:2 3:3 4:4 5:5
1 1:2 2:3 3:4 4:5 5:6
#if using precomputed kernel (i,e., request(3)),
#libsvm-format training data for index i is as below:
label_i 0:i 1:k(xi, x1) 2:k(xi, x2) 3:k(xi,x3) ...
```
For the precomputed kernel case, `isKernel` must be set to `True`.
> Referrencing the `libsvm/svm.cpp` file, and use the default `RBF` kernel in `libsvm/svm.cpp` for constructing user-defined `linear_RBF kernel` in the same way.( i.e., Transform c-style code to python-style. )
> [name=Willy Chiu]
```python
def gen_libsvm_format_data(x_data, y_data, output, isKernel=False):
x_df = pd.read_csv(x_data, header=None, dtype=str)
y_df = pd.read_csv(y_data, header=None, dtype=str)
for i in range(len(x_df.columns)):
x_df[i] = str(i+1) + ':' + x_df[i].astype(str)
if (isKernel):
k_df = np.arange(1,len(y_df)+1)
k_df = pd.DataFrame(k_df)
k_df = '0:' + k_df[0].astype(str)
x_df = pd.concat([k_df, x_df], axis = 1)
format_df = pd.concat([y_df, x_df], axis = 1)
format_df.to_csv(output, sep=' ', index=False, header=None)
def linear_RBF_kernel(u, v, g):
linear_x = np.matmul(u, v.T)
rbf_x = np.sum(u**2, axis=1)[:,None] + np.sum(v**2, axis=1)[None,:] - 2*linear_x
rbf_x = np.abs(rbf_x) * (-g)
rbf_x = np.exp(rbf_x)
return linear_x + rbf_x
```
```python
#preparing the precomputed kernel
print('preparing the precomputed kernel...')
x_train = np.genfromtxt('dataset/X_train.csv', delimiter=',')
x_test = np.genfromtxt('dataset/X_test.csv', delimiter=',')
x_train_precomputed = linear_RBF_kernel(x_train, x_train, 0.03125)
x_test_precomputed = linear_RBF_kernel(x_test, x_test, 0.03125)
np.savetxt('dataset/X_train_precomputed.csv', x_train_precomputed, fmt='%f', delimiter=',')
np.savetxt('dataset/X_test_precomputed.csv', x_test_precomputed, fmt='%f', delimiter=',')
#convert to libsvm-format
print('converting to libsvm-format...')
gen_libsvm_format_data('dataset/X_train_precomputed.csv', 'dataset/Y_train.csv', 'train_precomputed.csv', isKernel=True)
gen_libsvm_format_data('dataset/X_test_precomputed.csv', 'dataset/Y_test.csv', 'test_precomputed.csv', isKernel=True)
gen_libsvm_format_data('dataset/X_train.csv', 'dataset/Y_train.csv', 'train.csv')
gen_libsvm_format_data('dataset/X_test.csv', 'dataset/Y_test.csv', 'test.csv')
#open training data
print('opening training data...')
y_train, x_train = svmutil.svm_read_problem('train.csv')
prob = svmutil.svm_problem(y_train, x_train)
y1, x_train_precomputed = svmutil.svm_read_problem('train_precomputed.csv')
prob_precomputed = svmutil.svm_problem(y1, x_train_precomputed, isKernel=True)
```
### Step2
Grid search for each kernel ( include the precomputed kernel ) and find the best parameters of them.
>`libsvm/tools/grid.py` provide the API to grid-search for `RBF` kernel. I tried to revise it into `./grid.py`, so that it could find the best parameters of kernels which could be taken in the `libsvm` ( except sigmoid `kernel` ).
>And the `./grid.py` take a log scale to the searching range such as `libsvm/tools/grid.py` did.
> [name=Willy Chiu]
```python
# grid search for the best parameters
print('grid searching...')
grid_search = {
'linear' : '-t 0 -log2c -5,15,2',
'polynomial' : '-t 1 -log2c -5,15,2 -log2g -5,15,2 -log2r -3,5,2 -d 4',
'RBF' : '-t 2 -log2c -5,15,2 -log2g -5,15,2',
}
grid_search_precomputed = {
'linear+RBF': '-t 4 -log2c -5,15,2'
}
best_param = []
kernel_tab = []
kernel = []
for kernel_type, opts in grid_search.items():
rst, tab, col = grid.find_parameters(prob, opts)
kernel.append(kernel_type)
kernel_tab.append(pd.DataFrame(tab, columns=col))
best_param.append(rst)
for kernel_type, opts in grid_search_precomputed.items():
rst, tab, col = grid.find_parameters(prob_precomputed, opts)
kernel.append(kernel_type)
kernel_tab.append(pd.DataFrame(tab, columns=col))
best_param.append(rst)
for i in range(len(kernel_tab)):
kernel_tab[i].to_csv('best param/'+kernel[i]+'.csv')
print(kernel[i] + ':', end=' ')
print(best_param[i])
```
* Output
```
...
[ 2821/ 2860] Cross Validation Accuracy = 96.34%
[ 2822/ 2860] Cross Validation Accuracy = 98.2%
[ 2823/ 2860] Cross Validation Accuracy = 97.7%
...
linear: [0.03125, 97.2]
polynomial: [0.125, 8192, 32, 2, 98.4]
RBF: [2048, 0.03125, 98.74]
linear+RBF: [0.03125, 97.06]
```
### Step3
According the parameters form `Step2`, doing prediction.
> For saving time, I store the model after the first training of a new options and use the best parameters from grid-search result.
> [name=Willy Chiu]
```python
'''
linear kernel: Best c=0.03125, rate=97.2%
polynomial kernel: Best c=0.125, g=8192, r=32, d=2, rate=98.4%
RBF kernel: Best c=2048, g=0.03125, rate=98.74%
linear+RBF: Best c=0.03125, rate=97.06
'''
# predict
print('training and predicting...')
options = {
'linear' : '-q -t 0 -c 0.03125',
'polynomial' : '-q -t 1 -c 0.0125 -g 8192 -r 32 -d 2',
'RBF' : '-q -t 2 -c 2048 -g 0.03125',
}
options_precomputed = {
'linear+RBF' : '-q -t 4 -c 0.03125',
}
m = {}
p_label = {}
p_acc = {}
p_val = {}
train_time = {}
test_time = {}
train_flag = True
y_test, x_test = svmutil.svm_read_problem('test.csv')
y1, x_test_precomputed = svmutil.svm_read_problem('test_precomputed.csv')
for kernel_type, opts in options.items():
print('\tkernel type: {0}\n\t'.format(kernel_type), end='')
tic()
if (train_flag):
m[kernel_type] = svmutil.svm_train(prob, opts)
svmutil.svm_save_model('model/'+kernel_type+'.model', m[kernel_type])
else:
m[kernel_type] = svmutil.svm_load_model('model/'+kernel_type+'.model')
train_time[kernel_type] = toc()
tic()
p_label[kernel_type], p_acc[kernel_type], p_val[kernel_type] = \
svmutil.svm_predict(y_test, x_test, m[kernel_type])
test_time[kernel_type] = toc()
print('\tresult(acc, mse, scc): {0}\n'.format(
p_acc[kernel_type]))
for kernel_type, opts in options_precomputed.items():
print('\tkernel type: {0}\n\t'.format(kernel_type), end='')
tic()
if (train_flag):
m[kernel_type] = svmutil.svm_train(prob_precomputed, opts)
svmutil.svm_save_model('model/'+kernel_type+'.model', m[kernel_type])
else:
m[kernel_type] = svmutil.svm_load_model('model/'+kernel_type+'.model')
train_time[kernel_type] = toc()
tic()
p_label[kernel_type], p_acc[kernel_type], p_val[kernel_type] = \
svmutil.svm_predict(y1, x_test_precomputed, m[kernel_type])
test_time[kernel_type] = toc()
print('\tresult(acc, mse, scc): {0}\n'.format(
p_acc[kernel_type]))
```
* Output
```
predicting...
kernel type: linear
Accuracy = 96% (2400/2500) (classification)
result(acc, mse, scc): (96.0, 0.1308, 0.9357489703153389)
kernel type: polynomial
Accuracy = 97.68% (2442/2500) (classification)
result(acc, mse, scc): (97.68, 0.0648, 0.9677862381298715)
kernel type: RBF
Accuracy = 98.52% (2463/2500) (classification)
result(acc, mse, scc): (98.52, 0.05, 0.9750879341085731)
kernel type: linear+RBF
Accuracy = 21.36% (534/2500) (classification)
result(acc, mse, scc): (21.36, 2.9296, 0.023906139154160982)
```
### Step4
Besides the ACC rate and MSE, I also record the training time, predict time and number of support vectors for each kernel.
```python
# compare result
print('visualizing...')
kernel = list(options.keys()) + list(options_precomputed.keys())
p1 = plt.bar(kernel, [test_time[i] for i in kernel], label='test_time', alpha=0.4)
p2 = plt.bar(kernel, [train_time[i] for i in kernel], label='train_time', alpha=0.4, bottom=[test_time[kk] for kk in kernel])
plt.ylabel('time (s)')
plt.xlabel('kernel type')
plt.legend((p1[0], p2[0]), ('test_time', 'train_time'))
plt.savefig('result/times.png')
plt.show()
fig, ytick1 = plt.subplots()
ytick2 = ytick1.twinx()
ytick2.bar(kernel,[m[i].get_nr_sv() for i in kernel], label='# SV', alpha=0.4)
ytick2.set_ylabel('number of SV')
ytick2.legend(loc='lower right')
ytick1.plot(kernel,[p_acc[i][0] for i in kernel], 'b', label='ACC')
ytick1.plot(kernel,[100*p_acc[i][1] for i in kernel], 'r', label='MSE*100')
ytick1.plot(kernel, [100 for i in kernel], 'gray', linestyle='--')
ytick1.set_ylabel('Accuracy rate (%)')
ytick1.legend(loc='upper left')
ytick1.set_xlabel('kernel type')
plt.savefig('result/results.png')
plt.show()
```
### Analysis
In general, we would get higher ACC rate when the value of `C` is smaller. This reason of this circustance is we would get the smaller `slack` when setting larger value of `C`, so that overfitting would be occurred.
* Best parameter of each kernel
* Linear
| c | options | rate |
| ------- | ----------------------- | ----- |
| 0.03125 | -q -t 0 -v 5 -c 0.03125 | 97.2 |
| 0.125 | -q -t 0 -v 5 -c 0.125 | 96.98 |
| 0.5 | -q -t 0 -v 5 -c 0.5 | 96.22 |
| 2.0 | -q -t 0 -v 5 -c 2.0 | 96.28 |
| 8.0 | -q -t 0 -v 5 -c 8.0 | 96.32 |
| 32.0 | -q -t 0 -v 5 -c 32.0 | 96.2 |
| 128.0 | -q -t 0 -v 5 -c 128.0 | 96.34 |
| 512.0 | -q -t 0 -v 5 -c 512.0 | 96.34 |
| 2048.0 | -q -t 0 -v 5 -c 2048.0 | 96.16 |
| 8192.0 | -q -t 0 -v 5 -c 8192.0 | 96.16 |
| 32768.0 | -q -t 0 -v 5 -c 32768.0 | 96.48 |
* Polynomial
| c | g | r | d | options | rate |
| ------- | ------- | ---- | --- | --------------------------------------------- | ----- |
| 0.125 | 8192.0 | 32.0 | 2 | -q -t 1 -v 5 -c 0.125 -g 8192.0 -r 32.0 -d 2 | 98.4 |
| 0.125 | 512.0 | 32.0 | 2 | -q -t 1 -v 5 -c 0.125 -g 512.0 -r 32.0 -d 2 | 98.32 |
| 32768.0 | 2.0 | 2.0 | 2 | -q -t 1 -v 5 -c 32768.0 -g 2.0 -r 2.0 -d 2 | 98.32 |
|...|
|0.03125|0.03125|8.0|1|-q -t 1 -v 5 -c 0.03125 -g 0.03125 -r 8.0 -d 1|95.44
|0.03125|0.03125|32.0|1|-q -t 1 -v 5 -c 0.03125 -g 0.03125 -r 32.0 -d 1|95.40
|0.03125|0.03125|0.5|1|-q -t 1 -v 5 -c 0.03125 -g 0.03125 -r 0.5 -d 1|95.36
* RBF
| c | g | options | rate |
| ------ | ------- | --------------------------------- | ----------------- |
| 2048.0 | 0.03125 | -q -t 2 -v 5 -c 2048.0 -g 0.03125 | 98.74 |
| 2.0 | 0.03125 | -q -t 2 -v 5 -c 2.0 -g 0.03125 | 98.62 |
| 8.0 | 0.03125 | -q -t 2 -v 5 -c 8.0 -g 0.03125 | 98.6 |
|...|
|32768.0|2048.0|-q -t 2 -v 5 -c 32768.0 -g 2048.0|20.0
|32768.0|8192.0|-q -t 2 -v 5 -c 32768.0 -g 8192.0|20.0
|32768.0|32768.0|-q -t 2 -v 5 -c 32768.0 -g 32768.0|20.0
Kernel function: $e^{-\gamma\|x, x_* \|^2}$
According to the experiment result, when the $\gamma$ becoming larger, the `ACC rate` is much better than smaller $\gamma$.
> Because the RBF kernel's formula is similar to Gaussion distribution ( i.e., they are in direct proportion ), and the $\gamma$ is equal to $\frac{1}{2\sigma^2}$, when $\gamma$ is larger hints that $\sigma^2$ is smaller. And when $\sigma^2$ is smaller hints that the PDF of Gaussion distribution is sharper, so the overfitting may occurred.
> [name=Willy Chiu]
* Linear + RBF
| c | options | rate |
| ------- | ----------------------- | ----- |
| 0.03125 | -q -t 4 -v 5 -c 0.03125 | 97.18 |
| 0.125 | -q -t 4 -v 5 -c 0.125 | 97.0 |
| 32768.0 | -q -t 4 -v 5 -c 32768.0 | 96.66 |
| 8.0 | -q -t 4 -v 5 -c 8.0 | 96.64 |
| 128.0 | -q -t 4 -v 5 -c 128.0 | 96.64 |
| 512.0 | -q -t 4 -v 5 -c 512.0 | 96.62 |
| 2.0 | -q -t 4 -v 5 -c 2.0 | 96.56 |
| 8192.0 | -q -t 4 -v 5 -c 8192.0 | 96.5 |
| 0.5 | -q -t 4 -v 5 -c 0.5 | 96.46 |
| 32.0 | -q -t 4 -v 5 -c 32.0 | 96.42 |
| 2048.0 | -q -t 4 -v 5 -c 2048.0 | 96.42 |
>Linear+RBF kernel seems to work well while doing cross validation, but i got such a trashlike result while predicting.
>At the first, I thought it might be wrong in the precomputed format and I tried to replace the default RBF kernel function in `libsvm` into linear+RBF function and re-run th e RBF kernel ( It is linear+RBF kernel now ) for checking whether the predicting result is bad or not.
>But it still bad..., I don't even know why. I guess the reason is that RBF is none-linear, so when it combine to a linear kernel, the kernel funciton works badly.
>[name=Willy Chiu]
* Executing result of each kernel
* RBF's trainging and predicting are the most time consuming. But it has the best ACC rate.


### Referrence
* [https://www.csie.ntu.edu.tw/~cjlin/libsvm/](https://www.csie.ntu.edu.tw/~cjlin/libsvm/)
* [https://github.com/cjlin1/libsvm](https://github.com/cjlin1/libsvm)