Project12 - HackMD

--- title: Project12 tags: teach:MF --- # **ML & Fintech: Project by 陳姸榛** #### Keywords: trading, machine learning, cryptocurrency :::info 陽交大資財系機器學習與金融科技期末報告，由陳姸榛，李亦涵，鄧惠文共同編輯。最後更新時間2021/12/26。 ::: --- ## **I. Motivations** 研究ETH與其他可能影響加密貨幣的關係 ![eth](https://i-invdn-com.investing.com/news/Ethereum_800x533_L_1556445201.jpg) --- ## II. Data visualization #### **Data description** | Variables | Column | Descriptions | Type | | --------- | -------- | ------------------------------------ |:----------------------------:| | | ID | ID of each client | | | $y$ | ETH-USD | Price of Ethereum (USDT per ETH) | Continuous Time Seiries Data | | $x_1$ | BTC-USD | Price of Bitcoin (USDT per BTC) | Continuous Time Seiries Data | | $x_2$ | LTC-USD | Price of Litecoin (USDT per LTC) | Continuous Time Seiries Data | | $x_3$ | USDT-USD | Price of Tether US (USDT per USD) | Continuous Time Seiries Data | | $x_4$ | BCH-USD | Price of Bitcoin Cash (USDT per BCH) | Continuous Time Seiries Data | --- #### **Descriptive Statistics** ```data_close.describe()``` | | BCH-USD | BTC-USD | ETH-USD | LTC-USD | USDT-USD | | ----- | ----------- | ------------ | ----------- |:-----------:|:-----------:| | count | 1452.000000 | 1452.000000 | 1452.000000 | 1452.000000 | 1452.000000 | | mean | 547.122219 | 16591.015222 | 781.672215 | 102.915659 | 1.002036 | | min | 77.365776 | 3236.761719 | 84.308296 | 23.464331 | 0.966644 | | 25% | 243.593655 | 7036.397583 | 186.544605 | 51.583675 | 0.999999 | | 50% | 401.534653 | 9339.627441 | 332.875504 | 76.074245 | 1.000922 | | 75% | 640.060410 | 16425.549316 | 816.327744 | 144.799469 | 1.003481 | | max | 3923.070068 | 65992.835938 | 4414.746582 | 386.450775 | 1.077880 | --- #### **Original data plot** - Line plot for the data ![](https://i.imgur.com/7TagESL.png) - Pairwise scatter plots: ETHUSD and BTCUSD/LTCUSD may have correlations and patterns. ![](https://i.imgur.com/42zbIFN.png) - The data appears to skewed to right, so I use a log-transformation. --- #### **Log data plot** ![](https://i.imgur.com/awb8hCs.png) - After log-transformation, the data seems to be more symmetrically distributed. ![](https://i.imgur.com/VAOalQb.png) - The heat map of correlations ![](https://i.imgur.com/1jsiOcl.png) --- ## **III. Problem formulation and methods** #### **Problem formulation** We would like to build a model to predict $y$ from $x=(x_1,x_2,x_3,,x_4)$. [*(See here if not understanding the variables)*](https://hackmd.io/JeqjS1aBQFmfKqKcrE__Dg?view#Data-description) ### **Benchmark method** The benchmark model is the *multi-linear regression*:$$y = \alpha + \beta_1x_1 + \beta_2x_2 + \beta_3x_3 + \beta_4x_4$$ $\space$ $\space$ #### **In-sample and Out-of-sample Analysis** - Trained data : test_data = 75% : 25% - Randomize the data *(random seed = 514)* ``` from sklearn.model_selection import train_test_split seed = 514 X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state = seed) ``` --- ### **Pre-process of the data** Since the data is ***time-series data***, firstly test the ***stationary*** of the log-data. $\space$ **Original data** ``` for col in data_clean1.columns: adf, pvalue1, critical_values = adf_val(data_clean1[col], str(col) + ' time series', str(col) + ' acf', str(col) + ' pacf', col) pvalue = acorr_val(data_clean1[col]) ``` ##### *ETH/USD* ![](https://i.imgur.com/3jUKzOy.png =350x250)![](https://i.imgur.com/Qy31tH2.png =350x250) *score* | adf (t-score) | pvalue | critical_values | |:-------------------:| ------------------ |:--------------------------------------------------------------------------------:| | 0.05354943779534627 | 0.9627428163077016 | '1%': -3.4348961395618476, '5%': -2.863547812296987, '10%': -2.5678389447194556} | --- ##### *BTC/USD* ![](https://i.imgur.com/1DpbU7Q.png =350x250)![](https://i.imgur.com/jGZqaRl.png =350x250) *score* | adf (t-score) | pvalue | critical_values | |:--------------------:| ------------------ |:-------------------------------------------------------------------------------:| | -0.23050704222161775 | 0.9347862138150373 | '1%': -3.4348961395618476, '5%': -2.863547812296987, '10%': -2.5678389447194556 | --- ##### *LTC/USD* ![](https://i.imgur.com/8S4iLl5.png =350x250)![](https://i.imgur.com/M8UCJ1b.png =350x250) *score* | adf (t-score) | pvalue | critical_values | |:-------------------:|:-------------------:|:--------------------------------------------------------------------------------:| | -1.7958067828078763 | 0.38254548790064546 | '1%': -3.4348835326305642, '5%': -2.863542248636555, '10%': -2.5678359819686065} | --- ##### *USDT/USD* ![](https://i.imgur.com/vnHVemc.png =350x250)![](https://i.imgur.com/eMSPTJe.png =350x250) *score* | adf (t-score) | pvalue | critical_values | |:------------------:| --------------------- |:------------------------------------------------------------------------------:| | -4.772480038610099 | 6.137652242070711e-05 | '1%': -3.434911997169608, '5%': -2.863554810504947, '10%': -2.567842671398422} | --- ##### *BCH/USD* ![](https://i.imgur.com/3Am8pZr.png =350x250)![](https://i.imgur.com/kYEyyMA.png =350x250) *score* | adf (t-score) | pvalue | critical_values | |:------------------:|:------------------:|:--------------------------------------------------------------------------------:| | -2.308429376405863 | 0.1692637409234664 | '1%': -3.4348772553489617, '5%': -2.8635394783531085, '10%': -2.5678345067434516 | --- #### ***We can see that the data is non-stationary, so we do the difference for the data.*** --- $\space$ **Difference data** ``` for col in data_clean1.columns: if col != 'USDT-USD': d = np.diff(data_clean1[col]) adf, pvalue1, critical_values = adf_val(d, str(col) + ' time series', str(col) + ' acf', str(col) + ' pacf', col) pvalue = acorr_val(d) ``` ##### *ETH/USD* ![](https://i.imgur.com/bq2s75j.png =350x250)![](https://i.imgur.com/6KlbmLy.png =350x250) *score* | adf (t-score) | pvalue | critical_values | |:-------------------:|:----------------------:|:-------------------------------------------------------------------------------:| | -11.307999067734775 | 1.2592745961913662e-20 | '1%': -3.4348961395618476, '5%': -2.863547812296987, '10%': -2.5678389447194556 | --- ##### *BTC/USD* ![](https://i.imgur.com/hOF3iOB.png =350x250)![](https://i.imgur.com/x83O6Ef.png =350x250) *score* | adf (t-score) | pvalue | critical_values | |:-------------------:| --------------------- |:-------------------------------------------------------------------------------:| | -10.937074646165145 | 9.500282250714303e-20 | '1%': -3.4348961395618476, '5%': -2.863547812296987, '10%': -2.5678389447194556 | --- ##### *LTC/USD* ![](https://i.imgur.com/IXcy70S.png =350x250)!![](https://i.imgur.com/bq3LsRa.png =350x250) *score* | adf (t-score) | pvalue | critical_values | |:-------------------:| ---------------------- |:-------------------------------------------------------------------------------:| | -14.285460468186528 | 1.3022363376294658e-26 | '1%': -3.4348835326305642, '5%': -2.863542248636555, '10%': -2.5678359819686065 | --- ##### *USDT/USD* ###### ***Since USDT/USD is already stationary, we do not deal it with difference method.*** --- ##### *BCH/USD* ![](https://i.imgur.com/edXj7qj.png =350x250)![](https://i.imgur.com/TWPWWNh.png =350x250) *score* | adf (t-score) | pvalue | critical_values | |:-------------------:| ---------------------- |:--------------------------------------------------------------------------------:| | -18.118718253775377 | 2.5252040988004006e-30 | '1%': -3.4348772553489617, '5%': -2.8635394783531085, '10%': -2.5678345067434516 | --- #### ***The data is stationary after difference.*** After pre-process of the data, we start building the models. --- ### **Analysis** ##### ***Stochastic Gradient Descent Regression (SGD)*** Randomly chooses a sample to count its gradient and update the gradient until its minimum is finded. > It may not find the global minimum because of the noise. ![SGD](https://i.imgur.com/QwuZjDn.png) The results of the regression are the weights, the weights are the influence of each parameter which will decide $y$. ##### ***Support Vector Regression (SVR)*** Use the same principle which using in Support Vector Machines. Do not calculate the loss if the data is in the tube. > C is the penalty parameter, gamma is the coefficient of core function. Radial-basis function (rbf) kernel : $$k(x_i, x_j) = exp(-\gamma||x_i-x_j||^{2})$$ --- ### **Results** #### **In-sample results** *Formula :*$$y = -5.75379325 + 0.92423158\times x_1 + 0.24077037\times x_2 - 0.34743644\times x_3 + 0.34650823\times x_4$$ ``` regr1 = linear_model.LinearRegression() regr1.fit(X, Y) y_pred1 = regr1.predict(X) MSE = sm.mean_squared_error(ss_y.inverse_transform(Y), ss_y.inverse_transform(y_pred1)) r2 = sm.r2_score(Y, y_pred1) MAE = sm.mean_absolute_error(ss_y.inverse_transform(Y), ss_y.inverse_transform(y_pred1)) ``` | Measures | Multi-Linear Regression | |:--------------------:| ----------------------- | | Mean Squared Error | 0.10913527215565517 | | R2 Score | 0.9052687219059528 | | Mean Aabsolute Error | 0.28229806474958885 | --- #### **Out-of-sample comparisons** Consider a simple 75%-25% split on the data. $\space$ ***with Multi-Linear Regression :***$$y = 1.6277666\times10^{-15} + 0.70468677\times x_1 + 0.1376736 \times x_2 + 0.00367748\times x_3 - 0.23784049\times x_4$$ ``` regr2 = linear_model.LinearRegression() regr2.fit(X_train, Y_train) y_pred2 = regr2.predict(X_test) MSE = sm.mean_squared_error(ss_y.inverse_transform(Y_test), ss_y.inverse_transform(y_pred2)) r2 = sm.r2_score(Y_test, y_pred2) MAE = sm.mean_absolute_error(ss_y.inverse_transform(Y_test), ss_y.inverse_transform(y_pred2)) ``` $\space$ ***with Stochastic Gradient Descent Regression :***$$y = -0.00448248 + 0.65282598\times x_1 + 0.1854342 \times x_2 + 0.00132704\times x_3 - 0.24815886\times x_4$$ ``` sgdr = linear_model.SGDRegressor(loss='epsilon_insensitive', penalty='l1') sgdr.fit(X_train, Y_train) sgdr_y_predict = sgdr.predict(X_test) MSE = sm.mean_squared_error(ss_y.inverse_transform(Y_test), ss_y.inverse_transform(sgdr_y_predict)) r2 = sm.r2_score(Y_test, sgdr_y_predict) MAE = sm.mean_absolute_error(ss_y.inverse_transform(Y_test), ss_y.inverse_transform(sgdr_y_predict)) ``` $\space$ ***with Support Vector Regression :*** ``` svr = SVR(kernel='rbf', C=100, gamma='auto') svr.fit(X_train, Y_train) svr_y_predict = svr.predict(X_test) MSE = sm.mean_squared_error(ss_y.inverse_transform(Y_test), ss_y.inverse_transform(svr_y_predict)) r2 = sm.r2_score(Y_test, svr_y_predict) MAE = sm.mean_absolute_error(ss_y.inverse_transform(Y_test), ss_y.inverse_transform(svr_y_predict)) ``` $\space$ | Measures | Multi-Linear Regression | Stochastic Gradient Descent Regression | Support Vector Regression | | -------------------- | ----------------------- | -------------------------------------- |:-------------------------:| | Mean Squared Error | 0.11015421516874048 | 0.1128351796212331 | 0.025146669744170205 | | R2 Score | 0.8934814292099172 | 0.890888950098872 | 0.9756832971196244 | | Mean Aabsolute Error | 0.27982007215817556 | 0.2782875289456675 | 0.11039280278085983 | --- #### **Cross Validation** Spilt the data into 10 group and test the data. ``` folds = KFold(n_splits=10, shuffle=True, random_state=100) score1 = cross_val_score(regr2, X_train, Y_train, scoring='r2', cv=folds) score2 = cross_val_score(sgdr, X_train, Y_train, scoring='r2', cv=folds) score3 = cross_val_score(svr, X_train, Y_train, scoring='r2', cv=folds) score1.mean() score2.mean() score3.mean() ``` | | Multi-Linear Regression | Stochastic Gradient Descent Regression | Support Vector Regression | | ------- | ----------------------- |:--------------------------------------:| ------------------------- | | $R^{2}$ | 0.9065375596312022 | 0.9034638799712005 | 0.9754203685846681 | --- ## IV. Conclusion **We can see that, the SVR model seems to fit the data and predict the return so well, so choosing SVR as the forecasting model may be a good choice.** --- ## V. Reference - [Apply ML to Crypto](https://www.softkraft.co/applying-machine-learning-to-cryptocurrency-trading/) - [5 methods crypto trader must know](https://www.coindesk.com/tech/2020/10/16/five-machine-learning-methods-crypto-traders-should-know-about/) - [Plotly for pacf & acf](https://community.plotly.com/t/plot-pacf-plot-acf-autocorrelation-plot-and-lag-plot/24108/2) - [Time Series Analysis](https://www.itread01.com/content/1545816248.html) - [Data Science](https://www.itread01.com/content/1545816248.html) - [資料分類 Support Vector Machines](https://ithelp.ithome.com.tw/articles/10203507) - [python機器學習API介紹27：高級篇——非線性回歸SVR](https://looknews.cc/zh-tw/youxi/594346.html) - [SVM有監督學習 LinearSVC, LinearSVR,SVC,SVR -- 024](https://blog.csdn.net/u010986753/article/details/105021495) - [[第六天] 資料分類 Support Vector Machines (2) ](https://ithelp.ithome.com.tw/articles/10203507) - [梯度下降法（GD,SGD,Mini-Batch GD）線上性迴歸中的使用](https://www.itread01.com/content/1550087845.html) - [機器/深度學習-基礎數學(三):梯度最佳解相關算法(gradient descent optimization algorithms)](https://chih-sheng-huang821.medium.com/%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92-%E5%9F%BA%E7%A4%8E%E6%95%B8%E5%AD%B8-%E4%B8%89-%E6%A2%AF%E5%BA%A6%E6%9C%80%E4%BD%B3%E8%A7%A3%E7%9B%B8%E9%97%9C%E7%AE%97%E6%B3%95-gradient-descent-optimization-algorithms-b61ed1478bd7) - [Cross-Validation with Linear Regression](https://www.kaggle.com/jnikhilsai/cross-validation-with-linear-regression) - [機器學習筆記之SVM（SVR）演算法](https://www.itread01.com/content/1546145477.html) - [機器學習之路: python 線性回歸LinearRegression, 隨機參數回歸SGDRegressor 預測波士頓房價](https://www.itread01.com/content/1525013763.html) --- ## VI. Data and Code #### Data links - [ETH/USD Data](https://finance.yahoo.com/quote/ETH-USD?p=ETH-USD) - [BTC/USD Data](https://finance.yahoo.com/quote/BTC-USD?p=BTC-USD) - [USDT/USD Data](https://finance.yahoo.com/quote/USDT-USD?p=USDT-USD) - [LTC/USD Data](https://finance.yahoo.com/quote/LTC-USD?p=LTC-USD) - [BCH/USD Data](https://finance.yahoo.com/quote/BCH-USD?p=BCH-USD) ---