---
title: Project12
tags: teach:MF
---
# **ML & Fintech: Project by 陳姸榛**
#### Keywords: trading, machine learning, cryptocurrency
:::info
陽交大資財系機器學習與金融科技期末報告,由陳姸榛,李亦涵,鄧惠文共同編輯。最後更新時間2021/12/26。
:::
---
## **I. Motivations**
研究ETH與其他可能影響加密貨幣的關係

---
## II. Data visualization
#### **Data description**
| Variables | Column | Descriptions | Type |
| --------- | -------- | ------------------------------------ |:----------------------------:|
| | ID | ID of each client | |
| $y$ | ETH-USD | Price of Ethereum (USDT per ETH) | Continuous Time Seiries Data |
| $x_1$ | BTC-USD | Price of Bitcoin (USDT per BTC) | Continuous Time Seiries Data |
| $x_2$ | LTC-USD | Price of Litecoin (USDT per LTC) | Continuous Time Seiries Data |
| $x_3$ | USDT-USD | Price of Tether US (USDT per USD) | Continuous Time Seiries Data |
| $x_4$ | BCH-USD | Price of Bitcoin Cash (USDT per BCH) | Continuous Time Seiries Data |
---
#### **Descriptive Statistics**
```data_close.describe()```
| | BCH-USD | BTC-USD | ETH-USD | LTC-USD | USDT-USD |
| ----- | ----------- | ------------ | ----------- |:-----------:|:-----------:|
| count | 1452.000000 | 1452.000000 | 1452.000000 | 1452.000000 | 1452.000000 |
| mean | 547.122219 | 16591.015222 | 781.672215 | 102.915659 | 1.002036 |
| min | 77.365776 | 3236.761719 | 84.308296 | 23.464331 | 0.966644 |
| 25% | 243.593655 | 7036.397583 | 186.544605 | 51.583675 | 0.999999 |
| 50% | 401.534653 | 9339.627441 | 332.875504 | 76.074245 | 1.000922 |
| 75% | 640.060410 | 16425.549316 | 816.327744 | 144.799469 | 1.003481 |
| max | 3923.070068 | 65992.835938 | 4414.746582 | 386.450775 | 1.077880 |
---
#### **Original data plot**
- Line plot for the data

- Pairwise scatter plots: ETHUSD and BTCUSD/LTCUSD may have correlations and patterns.

- The data appears to skewed to right, so I use a log-transformation.
---
#### **Log data plot**

- After log-transformation, the data seems to be more symmetrically distributed.

- The heat map of correlations

---
## **III. Problem formulation and methods**
#### **Problem formulation**
We would like to build a model to predict $y$ from $x=(x_1,x_2,x_3,,x_4)$.
[*(See here if not understanding the variables)*](https://hackmd.io/JeqjS1aBQFmfKqKcrE__Dg?view#Data-description)
### **Benchmark method**
The benchmark model is the *multi-linear regression*:$$y = \alpha + \beta_1x_1 + \beta_2x_2 + \beta_3x_3 + \beta_4x_4$$
$\space$
$\space$
#### **In-sample and Out-of-sample Analysis**
- Trained data : test_data = 75% : 25%
- Randomize the data *(random seed = 514)*
```
from sklearn.model_selection import train_test_split
seed = 514
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state = seed)
```
---
### **Pre-process of the data**
Since the data is ***time-series data***, firstly test the ***stationary*** of the log-data.
$\space$
**Original data**
```
for col in data_clean1.columns:
adf, pvalue1, critical_values = adf_val(data_clean1[col], str(col) + ' time series', str(col) + ' acf', str(col) + ' pacf', col)
pvalue = acorr_val(data_clean1[col])
```
##### *ETH/USD*

*score*
| adf (t-score) | pvalue | critical_values |
|:-------------------:| ------------------ |:--------------------------------------------------------------------------------:|
| 0.05354943779534627 | 0.9627428163077016 | '1%': -3.4348961395618476, '5%': -2.863547812296987, '10%': -2.5678389447194556} |
---
##### *BTC/USD*

*score*
| adf (t-score) | pvalue | critical_values |
|:--------------------:| ------------------ |:-------------------------------------------------------------------------------:|
| -0.23050704222161775 | 0.9347862138150373 | '1%': -3.4348961395618476, '5%': -2.863547812296987, '10%': -2.5678389447194556 |
---
##### *LTC/USD*

*score*
| adf (t-score) | pvalue | critical_values |
|:-------------------:|:-------------------:|:--------------------------------------------------------------------------------:|
| -1.7958067828078763 | 0.38254548790064546 | '1%': -3.4348835326305642, '5%': -2.863542248636555, '10%': -2.5678359819686065} |
---
##### *USDT/USD*

*score*
| adf (t-score) | pvalue | critical_values |
|:------------------:| --------------------- |:------------------------------------------------------------------------------:|
| -4.772480038610099 | 6.137652242070711e-05 | '1%': -3.434911997169608, '5%': -2.863554810504947, '10%': -2.567842671398422} |
---
##### *BCH/USD*

*score*
| adf (t-score) | pvalue | critical_values |
|:------------------:|:------------------:|:--------------------------------------------------------------------------------:|
| -2.308429376405863 | 0.1692637409234664 | '1%': -3.4348772553489617, '5%': -2.8635394783531085, '10%': -2.5678345067434516 |
---
#### ***We can see that the data is non-stationary, so we do the difference for the data.***
---
$\space$
**Difference data**
```
for col in data_clean1.columns:
if col != 'USDT-USD':
d = np.diff(data_clean1[col])
adf, pvalue1, critical_values = adf_val(d, str(col) + ' time series', str(col) + ' acf', str(col) + ' pacf', col)
pvalue = acorr_val(d)
```
##### *ETH/USD*

*score*
| adf (t-score) | pvalue | critical_values |
|:-------------------:|:----------------------:|:-------------------------------------------------------------------------------:|
| -11.307999067734775 | 1.2592745961913662e-20 | '1%': -3.4348961395618476, '5%': -2.863547812296987, '10%': -2.5678389447194556 |
---
##### *BTC/USD*

*score*
| adf (t-score) | pvalue | critical_values |
|:-------------------:| --------------------- |:-------------------------------------------------------------------------------:|
| -10.937074646165145 | 9.500282250714303e-20 | '1%': -3.4348961395618476, '5%': -2.863547812296987, '10%': -2.5678389447194556 |
---
##### *LTC/USD*
!
*score*
| adf (t-score) | pvalue | critical_values |
|:-------------------:| ---------------------- |:-------------------------------------------------------------------------------:|
| -14.285460468186528 | 1.3022363376294658e-26 | '1%': -3.4348835326305642, '5%': -2.863542248636555, '10%': -2.5678359819686065 |
---
##### *USDT/USD*
###### ***Since USDT/USD is already stationary, we do not deal it with difference method.***
---
##### *BCH/USD*

*score*
| adf (t-score) | pvalue | critical_values |
|:-------------------:| ---------------------- |:--------------------------------------------------------------------------------:|
| -18.118718253775377 | 2.5252040988004006e-30 | '1%': -3.4348772553489617, '5%': -2.8635394783531085, '10%': -2.5678345067434516 |
---
#### ***The data is stationary after difference.***
After pre-process of the data, we start building the models.
---
### **Analysis**
##### ***Stochastic Gradient Descent Regression (SGD)***
Randomly chooses a sample to count its gradient and update the gradient until its minimum is finded.
> It may not find the global minimum because of the noise.

The results of the regression are the weights, the weights are the influence of each parameter which will decide $y$.
##### ***Support Vector Regression (SVR)***
Use the same principle which using in Support Vector Machines. Do not calculate the loss if the data is in the tube.
> C is the penalty parameter, gamma is the coefficient of core function.
Radial-basis function (rbf) kernel : $$k(x_i, x_j) = exp(-\gamma||x_i-x_j||^{2})$$
---
### **Results**
#### **In-sample results**
*Formula :*$$y = -5.75379325 + 0.92423158\times x_1 + 0.24077037\times x_2 - 0.34743644\times x_3 + 0.34650823\times x_4$$
```
regr1 = linear_model.LinearRegression()
regr1.fit(X, Y)
y_pred1 = regr1.predict(X)
MSE = sm.mean_squared_error(ss_y.inverse_transform(Y), ss_y.inverse_transform(y_pred1))
r2 = sm.r2_score(Y, y_pred1)
MAE = sm.mean_absolute_error(ss_y.inverse_transform(Y), ss_y.inverse_transform(y_pred1))
```
| Measures | Multi-Linear Regression |
|:--------------------:| ----------------------- |
| Mean Squared Error | 0.10913527215565517 |
| R2 Score | 0.9052687219059528 |
| Mean Aabsolute Error | 0.28229806474958885 |
---
#### **Out-of-sample comparisons**
Consider a simple 75%-25% split on the data.
$\space$
***with Multi-Linear Regression :***$$y = 1.6277666\times10^{-15} + 0.70468677\times x_1 + 0.1376736 \times x_2 + 0.00367748\times x_3 - 0.23784049\times x_4$$
```
regr2 = linear_model.LinearRegression()
regr2.fit(X_train, Y_train)
y_pred2 = regr2.predict(X_test)
MSE = sm.mean_squared_error(ss_y.inverse_transform(Y_test), ss_y.inverse_transform(y_pred2))
r2 = sm.r2_score(Y_test, y_pred2)
MAE = sm.mean_absolute_error(ss_y.inverse_transform(Y_test), ss_y.inverse_transform(y_pred2))
```
$\space$
***with Stochastic Gradient Descent Regression :***$$y = -0.00448248 + 0.65282598\times x_1 + 0.1854342 \times x_2 + 0.00132704\times x_3 - 0.24815886\times x_4$$
```
sgdr = linear_model.SGDRegressor(loss='epsilon_insensitive', penalty='l1')
sgdr.fit(X_train, Y_train)
sgdr_y_predict = sgdr.predict(X_test)
MSE = sm.mean_squared_error(ss_y.inverse_transform(Y_test), ss_y.inverse_transform(sgdr_y_predict))
r2 = sm.r2_score(Y_test, sgdr_y_predict)
MAE = sm.mean_absolute_error(ss_y.inverse_transform(Y_test), ss_y.inverse_transform(sgdr_y_predict))
```
$\space$
***with Support Vector Regression :***
```
svr = SVR(kernel='rbf', C=100, gamma='auto')
svr.fit(X_train, Y_train)
svr_y_predict = svr.predict(X_test)
MSE = sm.mean_squared_error(ss_y.inverse_transform(Y_test), ss_y.inverse_transform(svr_y_predict))
r2 = sm.r2_score(Y_test, svr_y_predict)
MAE = sm.mean_absolute_error(ss_y.inverse_transform(Y_test), ss_y.inverse_transform(svr_y_predict))
```
$\space$
| Measures | Multi-Linear Regression | Stochastic Gradient Descent Regression | Support Vector Regression |
| -------------------- | ----------------------- | -------------------------------------- |:-------------------------:|
| Mean Squared Error | 0.11015421516874048 | 0.1128351796212331 | 0.025146669744170205 |
| R2 Score | 0.8934814292099172 | 0.890888950098872 | 0.9756832971196244 |
| Mean Aabsolute Error | 0.27982007215817556 | 0.2782875289456675 | 0.11039280278085983 |
---
#### **Cross Validation**
Spilt the data into 10 group and test the data.
```
folds = KFold(n_splits=10, shuffle=True, random_state=100)
score1 = cross_val_score(regr2, X_train, Y_train, scoring='r2', cv=folds)
score2 = cross_val_score(sgdr, X_train, Y_train, scoring='r2', cv=folds)
score3 = cross_val_score(svr, X_train, Y_train, scoring='r2', cv=folds)
score1.mean()
score2.mean()
score3.mean()
```
| | Multi-Linear Regression | Stochastic Gradient Descent Regression | Support Vector Regression |
| ------- | ----------------------- |:--------------------------------------:| ------------------------- |
| $R^{2}$ | 0.9065375596312022 | 0.9034638799712005 | 0.9754203685846681 |
---
## IV. Conclusion
**We can see that, the SVR model seems to fit the data and predict the return so well, so choosing SVR as the forecasting model may be a good choice.**
---
## V. Reference
- [Apply ML to Crypto](https://www.softkraft.co/applying-machine-learning-to-cryptocurrency-trading/)
- [5 methods crypto trader must know](https://www.coindesk.com/tech/2020/10/16/five-machine-learning-methods-crypto-traders-should-know-about/)
- [Plotly for pacf & acf](https://community.plotly.com/t/plot-pacf-plot-acf-autocorrelation-plot-and-lag-plot/24108/2)
- [Time Series Analysis](https://www.itread01.com/content/1545816248.html)
- [Data Science](https://www.itread01.com/content/1545816248.html)
- [資料分類 Support Vector Machines](https://ithelp.ithome.com.tw/articles/10203507)
- [python機器學習API介紹27:高級篇——非線性回歸SVR](https://looknews.cc/zh-tw/youxi/594346.html)
- [SVM有監督學習 LinearSVC, LinearSVR,SVC,SVR -- 024](https://blog.csdn.net/u010986753/article/details/105021495)
- [[第六天] 資料分類 Support Vector Machines (2) ](https://ithelp.ithome.com.tw/articles/10203507)
- [梯度下降法(GD,SGD,Mini-Batch GD)線上性迴歸中的使用](https://www.itread01.com/content/1550087845.html)
- [機器/深度學習-基礎數學(三):梯度最佳解相關算法(gradient descent optimization algorithms)](https://chih-sheng-huang821.medium.com/%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-%E5%9F%BA%E7%A4%8E%E6%95%B8%E5%AD%B8-%E4%B8%89-%E6%A2%AF%E5%BA%A6%E6%9C%80%E4%BD%B3%E8%A7%A3%E7%9B%B8%E9%97%9C%E7%AE%97%E6%B3%95-gradient-descent-optimization-algorithms-b61ed1478bd7)
- [Cross-Validation with Linear Regression](https://www.kaggle.com/jnikhilsai/cross-validation-with-linear-regression)
- [機器學習筆記之SVM(SVR)演算法](https://www.itread01.com/content/1546145477.html)
- [機器學習之路: python 線性回歸LinearRegression, 隨機參數回歸SGDRegressor 預測波士頓房價](https://www.itread01.com/content/1525013763.html)
---
## VI. Data and Code
#### Data links
- [ETH/USD Data](https://finance.yahoo.com/quote/ETH-USD?p=ETH-USD)
- [BTC/USD Data](https://finance.yahoo.com/quote/BTC-USD?p=BTC-USD)
- [USDT/USD Data](https://finance.yahoo.com/quote/USDT-USD?p=USDT-USD)
- [LTC/USD Data](https://finance.yahoo.com/quote/LTC-USD?p=LTC-USD)
- [BCH/USD Data](https://finance.yahoo.com/quote/BCH-USD?p=BCH-USD)
---