2020 Machine Learning Homework 5

School: National Chiao Tung University
Name: Shao-Wei ( Willy ) Chiu

tags: `NCTU`, `Machine Learning`, `Gaussian Process`, `SVM`, `GitHub`

Gaussian Process

Training data:
- $[\begin{matrix} x_{1} & y_{1} \\ . & . \\ . & . \\ . & . \\ x_{n} & y_{n} \end{matrix}] = [\begin{matrix} X & Y \end{matrix}]$
Kernel function:
- $k (x_{n}, x_{m}) = σ^{2} {(1 + \frac{{‖ x_{n} - x_{m} ‖}^{2}}{2 α ℓ^{2}})}^{- α}$

There is a function

f

could transfer each

x_{i}

into coressponding

y_{i}

( i.e.,

f (x_{i}) = y_{i}

).
Assume that

y_{i} = f (x_{i}) + ϵ, w h e r e ϵ \sim N (0, β)

and

f \sim N (0, K_{n})

.
(i.e.,

Y \sim N (f, β)

)
On estimate the

x_{*}

point, we have formula

[\begin{matrix} Y \\ y_{*} \end{matrix}] \sim N ([\begin{matrix} Y \\ y_{*} \end{matrix}] | 0, K_{n + 1})

After the derivation of probability, we get the

$μ (x_{*}) = k (x, x_{*})^{T} (K_{n} + β I)^{- 1} Y$
$c o v (x_{*}) = k (x_{*}, x_{*}) - k (x, x_{*})^{T} (K_{n} + β I)^{- 1} k (x, x_{*})$

This form is almost the same as the formula mentioned by Prof. Chiu in the class.
The tiny difference is that it take the
$β$ out of the matrix
$K$ , but the course silde takes it into the matrix.
Willy Chiu

Thus, we could apply the

x_{*}

to describe our model.

We notice that the kernel method is decided by some kernel parameters ( e.g.,

σ

α

ℓ

), so we need to find the parameters which could have the maximum likelihood.

In my practice, I choose the random value of all parameter between 0 and 10, and call the scipy.optimize.minimize to optimize it.
The relative formula is shown below,

a r g m a x (l n p (y | θ)) = - \frac{1}{2} l n | C_{θ} | - \frac{1}{2} y^{T} C_{θ}^{- 1} - \frac{N}{2} l n (2 π)

\propto - l n | C_{θ} | - y^{T} C_{θ}^{- 1} y

= a r g m i n (l n | C_{θ} | + y^{T} C_{θ}^{- 1} y)

Step1

Use get_K to compute the covarince matrix, and get_k_test use to compute the

k (x, x_{*})

matrix.

def rq_kernel(xn, xm, length_scale, scale_mixture, amplitude):
    delta = abs(xn-xm)
    return amplitude * (1 + delta**2/(2*scale_mixture*(length_scale**2)))**(-scale_mixture)

def get_K(X, length_scale, scale_mixture, amplitude, beta):
    n = len(X)
    K = np.zeros((n, n), dtype=np.float32)
    for i in range(0, n):
        for j in range(0, n):
            K[i, j] = rq_kernel(X[i], X[j], length_scale, scale_mixture, amplitude)
            if i==j:
                K[i, j] += 1/beta
    return K

def get_k_test(test, train, length_scale, scale_mixture, amplitude):
    n = len(train)
    K = np.zeros((n, 1), dtype=np.float32)
    for i in range(0, n):
        K[i, 0] = rq_kernel(train[i], test, length_scale, scale_mixture, amplitude)
    return K

Step2

Initial the value of parameters.

#given assumption
beta = 5.0

#initial kernel parameter
length_scale = rd.uniform(1, 10.0)
scale_mixture = rd.uniform(1, 10.0)
amplitude = rd.uniform(1, 10.0)
K = get_K(train_x, length_scale, scale_mixture, amplitude, beta)

Step3

Add the

x_{*}

for the range of [-60, 60], and apply the formula derived above.

test_x = np.arange(-60, 60, 0.5)

test_y = np.zeros((len(test_x), 1))
test_var = np.zeros((len(test_x), 1))
for i in range(len(test_x)):
    k = get_k_test(test_x[i], train_x, length_scale, scale_mixture, amplitude)
    test_y[i] = np.matmul(np.matmul(np.transpose(k), np.linalg.inv(K)), train_y)
    k_new = rq_kernel(test_x[i], test_x[i], length_scale, scale_mixture, amplitude) + 1/beta
    test_var[i] = k_new - np.matmul(np.matmul(np.transpose(k), np.linalg.inv(K)), k)

Step4

Use scipy.optimize.minimize to find the optimized parameter and re-do the gaussian process.

I found that if using random value of kernel parameters as its initial value, the result after optmizing might be bad for some extremly initial value.
Willy Chiu

#optimize the kernel parameters
def fun(x, args):
    X, Y, beta = args
    K = get_K(X, x[0], x[1], x[2], beta)
    v = np.log(np.linalg.det(K))+np.matmul(np.matmul(np.transpose(train_y), np.linalg.inv(K)), train_y)
    return v

args = [train_x, train_y, beta]
cons = ({'type': 'ineq', 'fun': lambda x: x[0] - 0.1},
        {'type': 'ineq', 'fun': lambda x: x[1] - 0.1},
        {'type': 'ineq', 'fun': lambda x: x[2] - 0.1})

x0 = np.array((length_scale, scale_mixture, amplitude))
res = minimize(fun, x0, args=[train_x, train_y, beta], method='SLSQP', constraints=cons)
length_scale, scale_mixture, amplitude = res.x

# re-try Gaussian Process with optimized parameter again
K = get_K(train_x, length_scale, scale_mixture, amplitude, beta)
test_x = np.arange(-60, 60, 0.5)

test_y = np.zeros((len(test_x), 1))
test_var = np.zeros((len(test_x), 1))
for i in range(len(test_x)):
    k = get_k_test(test_x[i], train_x, length_scale, scale_mixture, amplitude)
    test_y[i] = np.matmul(np.matmul(np.transpose(k), np.linalg.inv(K)), train_y)
    k_new = rq_kernel(test_x[i], test_x[i], length_scale, scale_mixture, amplitude) + 1/beta
    test_var[i] = k_new - np.matmul(np.matmul(np.transpose(k), np.linalg.inv(K)), k)

Result:
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Referrence

https://www.csie.ntu.edu.tw/~cjlin/mlgroup/tutorials/gpr.pdf

Support Vector Machines

Request:
1. Use different kernel and compare them
2. Use C-SVM and grid search for the best parameters
3. Use precomputed kernel
  - linear + RBF
libsvm is available

Step1

Compute the user-defined kernel and convert the given training data in libsvm format, for example:

#given training data
1,2,3,4,5
2,3,4,5,6

#given training label
5
1

#libsvm-format training data
5 1:1 2:2 3:3 4:4 5:5
1 1:2 2:3 3:4 4:5 5:6

#if using precomputed kernel (i,e., request(3)), 
#libsvm-format training data for index i is as below:
label_i 0:i 1:k(xi, x1) 2:k(xi, x2) 3:k(xi,x3) ...

For the precomputed kernel case, isKernel must be set to True.

Referrencing the libsvm/svm.cpp file, and use the default RBF kernel in libsvm/svm.cpp for constructing user-defined linear_RBF kernel in the same way.( i.e., Transform c-style code to python-style. )
Willy Chiu

def gen_libsvm_format_data(x_data, y_data, output, isKernel=False):
    x_df = pd.read_csv(x_data, header=None, dtype=str)
    y_df = pd.read_csv(y_data, header=None, dtype=str)

    for i in range(len(x_df.columns)):
        x_df[i] = str(i+1) + ':' + x_df[i].astype(str)

    if (isKernel):
        k_df = np.arange(1,len(y_df)+1)
        k_df = pd.DataFrame(k_df)
        k_df = '0:' + k_df[0].astype(str)
        x_df = pd.concat([k_df, x_df], axis = 1)

    format_df = pd.concat([y_df, x_df], axis = 1)
    format_df.to_csv(output, sep=' ', index=False, header=None)

def linear_RBF_kernel(u, v, g):
    linear_x = np.matmul(u, v.T)
    rbf_x = np.sum(u**2, axis=1)[:,None] + np.sum(v**2, axis=1)[None,:] - 2*linear_x
    rbf_x = np.abs(rbf_x) * (-g)
    rbf_x = np.exp(rbf_x)
    return linear_x + rbf_x

#preparing the precomputed kernel
print('preparing the precomputed kernel...')
x_train = np.genfromtxt('dataset/X_train.csv', delimiter=',')
x_test = np.genfromtxt('dataset/X_test.csv', delimiter=',')
x_train_precomputed = linear_RBF_kernel(x_train, x_train, 0.03125)
x_test_precomputed = linear_RBF_kernel(x_test, x_test, 0.03125)
np.savetxt('dataset/X_train_precomputed.csv', x_train_precomputed, fmt='%f', delimiter=',')
np.savetxt('dataset/X_test_precomputed.csv', x_test_precomputed, fmt='%f', delimiter=',')

#convert to libsvm-format
print('converting to libsvm-format...')
gen_libsvm_format_data('dataset/X_train_precomputed.csv', 'dataset/Y_train.csv', 'train_precomputed.csv', isKernel=True)
gen_libsvm_format_data('dataset/X_test_precomputed.csv', 'dataset/Y_test.csv', 'test_precomputed.csv', isKernel=True)
gen_libsvm_format_data('dataset/X_train.csv', 'dataset/Y_train.csv', 'train.csv')
gen_libsvm_format_data('dataset/X_test.csv', 'dataset/Y_test.csv', 'test.csv')

#open training data
print('opening training data...')
y_train, x_train = svmutil.svm_read_problem('train.csv')
prob = svmutil.svm_problem(y_train, x_train)
y1, x_train_precomputed = svmutil.svm_read_problem('train_precomputed.csv')
prob_precomputed = svmutil.svm_problem(y1, x_train_precomputed, isKernel=True)

Step2

Grid search for each kernel ( include the precomputed kernel ) and find the best parameters of them.

libsvm/tools/grid.py provide the API to grid-search for RBF kernel. I tried to revise it into ./grid.py, so that it could find the best parameters of kernels which could be taken in the libsvm ( except sigmoid kernel ).
And the ./grid.py take a log scale to the searching range such as libsvm/tools/grid.py did.
Willy Chiu

# grid search for the best parameters
print('grid searching...')
grid_search = {
        'linear' : '-t 0 -log2c -5,15,2',
        'polynomial' : '-t 1 -log2c -5,15,2 -log2g -5,15,2 -log2r -3,5,2 -d 4',
        'RBF' : '-t 2 -log2c -5,15,2 -log2g -5,15,2',
        }
grid_search_precomputed = {
        'linear+RBF': '-t 4 -log2c -5,15,2'
        }
best_param = []
kernel_tab = []
kernel = []

for kernel_type, opts in grid_search.items():
    rst, tab, col = grid.find_parameters(prob, opts)
    kernel.append(kernel_type)
    kernel_tab.append(pd.DataFrame(tab, columns=col))
    best_param.append(rst)

for kernel_type, opts in grid_search_precomputed.items():
    rst, tab, col = grid.find_parameters(prob_precomputed, opts)
    kernel.append(kernel_type)
    kernel_tab.append(pd.DataFrame(tab, columns=col))
    best_param.append(rst)

for i in range(len(kernel_tab)):
    kernel_tab[i].to_csv('best param/'+kernel[i]+'.csv')
    print(kernel[i] + ':', end=' ')
    print(best_param[i])

Output

...
[ 2821/ 2860] Cross Validation Accuracy = 96.34%
[ 2822/ 2860] Cross Validation Accuracy = 98.2%
[ 2823/ 2860] Cross Validation Accuracy = 97.7%
...
linear: [0.03125, 97.2]
polynomial: [0.125, 8192, 32, 2, 98.4]
RBF: [2048, 0.03125, 98.74]
linear+RBF: [0.03125, 97.06]

Step3

According the parameters form Step2, doing prediction.

For saving time, I store the model after the first training of a new options and use the best parameters from grid-search result.
Willy Chiu


'''
linear kernel:      Best c=0.03125, rate=97.2%
polynomial kernel:  Best c=0.125, g=8192, r=32, d=2, rate=98.4%
RBF kernel:         Best c=2048, g=0.03125, rate=98.74%
linear+RBF:         Best c=0.03125, rate=97.06
'''
# predict
print('training and predicting...')
options = {
    'linear' : '-q -t 0 -c 0.03125',
    'polynomial' : '-q -t 1 -c 0.0125 -g 8192 -r 32 -d 2',
    'RBF' : '-q -t 2 -c 2048 -g 0.03125',
}
options_precomputed = {
    'linear+RBF' : '-q -t 4 -c 0.03125',
}

m = {}
p_label = {}
p_acc = {}
p_val = {}
train_time = {}
test_time = {}
train_flag = True

y_test, x_test = svmutil.svm_read_problem('test.csv')
y1, x_test_precomputed = svmutil.svm_read_problem('test_precomputed.csv')

for kernel_type, opts in options.items():
    print('\tkernel type: {0}\n\t'.format(kernel_type), end='')
    tic()
    if (train_flag):
        m[kernel_type] = svmutil.svm_train(prob, opts)
        svmutil.svm_save_model('model/'+kernel_type+'.model', m[kernel_type])
    else:
        m[kernel_type] = svmutil.svm_load_model('model/'+kernel_type+'.model')
    train_time[kernel_type] = toc()
    tic()
    p_label[kernel_type], p_acc[kernel_type], p_val[kernel_type] = \
    svmutil.svm_predict(y_test, x_test, m[kernel_type])
    test_time[kernel_type] = toc()
    print('\tresult(acc, mse, scc): {0}\n'.format(
        p_acc[kernel_type]))

for kernel_type, opts in options_precomputed.items():
    print('\tkernel type: {0}\n\t'.format(kernel_type), end='')
    tic()
    if (train_flag):
        m[kernel_type] = svmutil.svm_train(prob_precomputed, opts)
        svmutil.svm_save_model('model/'+kernel_type+'.model', m[kernel_type])
    else:
        m[kernel_type] = svmutil.svm_load_model('model/'+kernel_type+'.model')
    train_time[kernel_type] = toc()
    tic()
    p_label[kernel_type], p_acc[kernel_type], p_val[kernel_type] = \
    svmutil.svm_predict(y1, x_test_precomputed, m[kernel_type])
    test_time[kernel_type] = toc()
    print('\tresult(acc, mse, scc): {0}\n'.format(
        p_acc[kernel_type]))

Output

predicting...
    kernel type: linear
    Accuracy = 96% (2400/2500) (classification)
    result(acc, mse, scc): (96.0, 0.1308, 0.9357489703153389)

    kernel type: polynomial
    Accuracy = 97.68% (2442/2500) (classification)
    result(acc, mse, scc): (97.68, 0.0648, 0.9677862381298715)

    kernel type: RBF
    Accuracy = 98.52% (2463/2500) (classification)
    result(acc, mse, scc): (98.52, 0.05, 0.9750879341085731)

    kernel type: linear+RBF
    Accuracy = 21.36% (534/2500) (classification)
    result(acc, mse, scc): (21.36, 2.9296, 0.023906139154160982)

Step4

Besides the ACC rate and MSE, I also record the training time, predict time and number of support vectors for each kernel.

# compare result
print('visualizing...')
kernel = list(options.keys()) + list(options_precomputed.keys()) 
p1 = plt.bar(kernel, [test_time[i] for i in kernel], label='test_time', alpha=0.4)
p2 = plt.bar(kernel, [train_time[i] for i in kernel], label='train_time', alpha=0.4, bottom=[test_time[kk] for kk in kernel])
plt.ylabel('time (s)')
plt.xlabel('kernel type')
plt.legend((p1[0], p2[0]), ('test_time', 'train_time'))
plt.savefig('result/times.png')
plt.show()

fig, ytick1 = plt.subplots()
ytick2 = ytick1.twinx()

ytick2.bar(kernel,[m[i].get_nr_sv() for i in kernel], label='# SV', alpha=0.4)
ytick2.set_ylabel('number of SV')
ytick2.legend(loc='lower right')

ytick1.plot(kernel,[p_acc[i][0] for i in kernel], 'b', label='ACC')
ytick1.plot(kernel,[100*p_acc[i][1] for i in kernel], 'r', label='MSE*100')
ytick1.plot(kernel, [100 for i in kernel], 'gray', linestyle='--')
ytick1.set_ylabel('Accuracy rate (%)')
ytick1.legend(loc='upper left')
ytick1.set_xlabel('kernel type')
plt.savefig('result/results.png')
plt.show()

Analysis

In general, we would get higher ACC rate when the value of C is smaller. This reason of this circustance is we would get the smaller slack when setting larger value of C, so that overfitting would be occurred.

Best parameter of each kernel

Linear

c	options	rate
0.03125	-q -t 0 -v 5 -c 0.03125	97.2
0.125	-q -t 0 -v 5 -c 0.125	96.98
0.5	-q -t 0 -v 5 -c 0.5	96.22
2.0	-q -t 0 -v 5 -c 2.0	96.28
8.0	-q -t 0 -v 5 -c 8.0	96.32
32.0	-q -t 0 -v 5 -c 32.0	96.2
128.0	-q -t 0 -v 5 -c 128.0	96.34
512.0	-q -t 0 -v 5 -c 512.0	96.34
2048.0	-q -t 0 -v 5 -c 2048.0	96.16
8192.0	-q -t 0 -v 5 -c 8192.0	96.16
32768.0	-q -t 0 -v 5 -c 32768.0	96.48

Polynomial

c	g	r	d	options	rate
0.125	8192.0	32.0	2	-q -t 1 -v 5 -c 0.125 -g 8192.0 -r 32.0 -d 2	98.4
0.125	512.0	32.0	2	-q -t 1 -v 5 -c 0.125 -g 512.0 -r 32.0 -d 2	98.32
32768.0	2.0	2.0	2	-q -t 1 -v 5 -c 32768.0 -g 2.0 -r 2.0 -d 2	98.32
…
0.03125	0.03125	8.0	1	-q -t 1 -v 5 -c 0.03125 -g 0.03125 -r 8.0 -d 1	95.44
0.03125	0.03125	32.0	1	-q -t 1 -v 5 -c 0.03125 -g 0.03125 -r 32.0 -d 1	95.40
0.03125	0.03125	0.5	1	-q -t 1 -v 5 -c 0.03125 -g 0.03125 -r 0.5 -d 1	95.36

RBF

c	g	options	rate
2048.0	0.03125	-q -t 2 -v 5 -c 2048.0 -g 0.03125	98.74
2.0	0.03125	-q -t 2 -v 5 -c 2.0 -g 0.03125	98.62
8.0	0.03125	-q -t 2 -v 5 -c 8.0 -g 0.03125	98.6
…
32768.0	2048.0	-q -t 2 -v 5 -c 32768.0 -g 2048.0	20.0
32768.0	8192.0	-q -t 2 -v 5 -c 32768.0 -g 8192.0	20.0
32768.0	32768.0	-q -t 2 -v 5 -c 32768.0 -g 32768.0	20.0

Kernel function:

e^{- γ ‖ x, x_{*} ‖^{2}}

According to the experiment result, when the

γ

becoming larger, the ACC rate is much better than smaller

γ

Because the RBF kernel's formula is similar to Gaussion distribution ( i.e., they are in direct proportion ), and the
$γ$ is equal to
$\frac{1}{2 σ^{2}}$ , when
$γ$ is larger hints that
$σ^{2}$ is smaller. And when
$σ^{2}$ is smaller hints that the PDF of Gaussion distribution is sharper, so the overfitting may occurred.
Willy Chiu

Linear + RBF

c	options	rate
0.03125	-q -t 4 -v 5 -c 0.03125	97.18
0.125	-q -t 4 -v 5 -c 0.125	97.0
32768.0	-q -t 4 -v 5 -c 32768.0	96.66
8.0	-q -t 4 -v 5 -c 8.0	96.64
128.0	-q -t 4 -v 5 -c 128.0	96.64
512.0	-q -t 4 -v 5 -c 512.0	96.62
2.0	-q -t 4 -v 5 -c 2.0	96.56
8192.0	-q -t 4 -v 5 -c 8192.0	96.5
0.5	-q -t 4 -v 5 -c 0.5	96.46
32.0	-q -t 4 -v 5 -c 32.0	96.42
2048.0	-q -t 4 -v 5 -c 2048.0	96.42

Linear+RBF kernel seems to work well while doing cross validation, but i got such a trashlike result while predicting.

At the first, I thought it might be wrong in the precomputed format and I tried to replace the default RBF kernel function in libsvm into linear+RBF function and re-run th e RBF kernel ( It is linear+RBF kernel now ) for checking whether the predicting result is bad or not.

But it still bad…, I don't even know why. I guess the reason is that RBF is none-linear, so when it combine to a linear kernel, the kernel funciton works badly.
Willy Chiu

Executing result of each kernel
- RBF's trainging and predicting are the most time consuming. But it has the best ACC rate.

2020 Machine Learning Homework 5

tags: NCTU, Machine Learning, Gaussian Process, SVM, GitHub

Gaussian Process

Step1

Step2

Step3

Step4

Referrence

Support Vector Machines

Step1

Step2

Step3

Step4

Analysis

Referrence

Read more

NCTU Machine Learning

LINUX 核心設計 課程用書筆記

Play in a Sandbox!?

2020 Machine Learning Homework 7

tags: `NCTU`, `Machine Learning`, `Gaussian Process`, `SVM`, `GitHub`

LINUX 核心設計課程用書筆記