Project02 - HackMD

--- title: Project02 tags: teach:MF --- ###### tags: `機器學習與金融科技` # ML and FinTech: Project by 杜知翰 > **keywords**：Stock Prediction、Machine Learning Portfolio、LSTM Model ## 1. Motivations 我在剛接觸機器學習時，就是從財金相關的切入(當時看的論文是[**EIIE**](https://arxiv.org/abs/1706.10059))，但市面上也有充斥一些報導，像是：[A.i人工智慧真的能預測股市嗎](https://vocus.cc/article/5f815330fd8978000160a60c)、[Predicting the Stock Market is Hard: Creating a Machine-Learning Model (Probably) Won’t Help](https://towardsdatascience.com/predicting-the-stock-market-is-hard-creating-a-machine-learning-model-probably-wont-help-e449039c9fe3)，對於AI預測股市持反對的意見，認為股市的Random Walk是無法藉由資料去做進一步的預測。因此我希望能藉由這堂課，繼續深入有關股市預測的模型，使用不同模型預測股價資料，並比較之間的差別，找到最適合的模型 #### 挑選的股票：TSLA #### bench mark：linear regression model --- ## 2. EDA * Data is collected from Yahoo Finance * 我使用pandas-datareader套件來獲取股市資料 * 使用 TSLA、Google 的股票資料進行分析 ### TSLA ![](https://i.imgur.com/OZ1FbGT.png =50%x)![](https://i.imgur.com/02kwSBm.png =50%x) ![](https://i.imgur.com/zxOugM1.png) ### Google ![](https://i.imgur.com/FZWYgk4.png =50%x)![](https://i.imgur.com/ymQm0bK.png =50%x) ![](https://i.imgur.com/rzhbvTO.png) --- ## 3. Problem formulation predict y from x=(x~1~,x~2~,...,x~p~)，p= 1 or 4 or 60 ### 使用了三種資料讀取不同方式( p )、五種不同模型，比較彼此的正確率 #### 三種資料讀取： * 使用 1 天前的價格預測今天的價格 * 使用 high,low,open,volume預測 adj close 價格 * 使用連續 60 天的資料預測今天的價格 * 有先使用 MinMaxScalar 來讓adj close價格落在 0～1 之間 #### 五種不同模型： * Linear Regression * Decision Tree Regression * RNN(LSTM) * 2層 LSTM、2層 Dense * optimizer：adam * loss = mean_squared_error * DNN * 5層 Dense * activation function：relu * optimizer：adam * loss = mean_squared_error * CNN * 1層 Conv1D * filters=64,kernel_size=3 * 1層 MaxPooling1D * pool size=2 * 1層 Flatten * 2層 Dense * activation function：relu * optimizer：adam * loss = mean_squared_error #### 資料切法 * 我使用股票從2018-01-01~2021-12-17的資料，切出最後10天的資料當作Out of sample test，其他資料當作training資料，也把全部的資料用來當作In sample test #### 衡量標準 * 比較測試結果的 MAE，也有繪出與真實答案比較的圖形 #### benchmark：Linear Regression，使用 high,low,open,volume預測 adj close 價格 --- ## 4. Analysis and Conclusion ### TSLA #### 使用 high,low,open,volume預測 adj close ##### Linear Regression(Benchmark) ![](https://i.imgur.com/NqvJJCb.png =50%x)![](https://i.imgur.com/37U4Sei.png =50%x) ##### Decision Tree Regression ![](https://i.imgur.com/ZEu4B2y.png =50%x)![](https://i.imgur.com/0K3PMEX.png =50%x) #### 使用1天前的價格預測今天的價格 ##### Linear Regression ![](https://i.imgur.com/j6yIV1g.png =50%x)![](https://i.imgur.com/R5OhpJb.png =50%x) ##### Decision Tree Regression ![](https://i.imgur.com/fIWkTq3.png =50%x)![](https://i.imgur.com/4VtXLTr.png =50%x) #### 使用連續60天的資料預測今天的價格 ##### RNN(LSTM) ![](https://i.imgur.com/TogXeEh.png =50%x)![](https://i.imgur.com/O3l4TTW.png =50%x) ##### DNN ![](https://i.imgur.com/Emgxcdx.png =50%x)![](https://i.imgur.com/CjZJIOu.png =50%x) ##### CNN ![](https://i.imgur.com/1fneJJk.png =50%x)![](https://i.imgur.com/GUO304n.png =50%x) #### In Sample Test MAE值 > TSLA平均Close price：290.62 |Model&資料| MAE |MAE/AVE(Close Price)| |:-:| :-: |:-: | | Linear Regression(使用 high,low,open,volume預測 adj close) |2.71|0.0093| |Decision Tree Regression(使用 high,low,open,volume預測 adj close)|0.88|0.0030| |Linear Regression(使用1天前的價格預測今天的價格)|7.94|0.0273| |Decision Tree Regression(使用1天前的價格預測今天的價格)|2.62|0.0090| |RNN(LSTM)(使用連續60天的資料預測今天的價格)|10.28|0.0354| |DNN(使用連續60天的資料預測今天的價格)|14.84|0.0511| |CNN(使用連續60天的資料預測今天的價格)|13.28|0.0457| #### Out Of Sample Test MAE值 > TSLA 最後10天平均Close price：991.09 |Model&資料|MAE |MAE/AVE(Close Price)| |:-:| :-: |:-: | | Linear Regression(使用 high,low,open,volume預測 adj close) |14.27|0.0144| |Decision Tree Regression(使用 high,low,open,volume預測 adj close)|42.09|0.0425| |Linear Regression(使用1天前的價格預測今天的價格)|27.51|0.0278| |Decision Tree Regression(使用1天前的價格預測今天的價格)|57.27|0.0578| |RNN(LSTM)(使用連續60天的資料預測今天的價格)|32.41|0.0327| |DNN(使用連續60天的資料預測今天的價格)|42.01|0.0424| |CNN(使用連續60天的資料預測今天的價格)|47.50|0.0479| ### Google #### 使用 high,low,open,volume預測 adj close ##### Linear Regression(Benchmark) ![](https://i.imgur.com/Rrr3Qn0.png =50%x)![](https://i.imgur.com/hHerSHU.png =50%x) ##### Decision Tree Regression ![](https://i.imgur.com/roTx1Mt.png =50%x)![](https://i.imgur.com/LBgEwP2.png =50%x) #### 使用1天前的價格預測今天的價格 ##### Linear Regression ![](https://i.imgur.com/FhuRkHw.png =50%x)![](https://i.imgur.com/iHXAtLf.png =50%x) ##### Decision Tree Regression ![](https://i.imgur.com/JbQGZXJ.png =50%x)![](https://i.imgur.com/j0ZXCNR.png =50%x) #### 使用連續60天的資料預測今天的價格 ##### RNN(LSTM) ![](https://i.imgur.com/xCDTwic.png =50%x)![](https://i.imgur.com/T6zk8Lh.png =50%x) ##### DNN ![](https://i.imgur.com/0BsoX1K.png =50%x)![](https://i.imgur.com/cEgrzfc.png =50%x) ##### CNN ![](https://i.imgur.com/BKLPUar.png =50%x)![](https://i.imgur.com/TWLs1DS.png =50%x) #### In Sample Test MAE值 > Google平均Close price：1561.68 |Model&資料|MAE|MAE/AVE(Close Price)| |:-:|:-:|:-:| | Linear Regression(使用 high,low,open,volume預測 adj close) |6.77|0.0043| |Decision Tree Regression(使用 high,low,open,volume預測 adj close)|2.53|0.0016| |Linear Regression(使用1天前的價格預測今天的價格)|18.67|0.0120| |Decision Tree Regression(使用1天前的價格預測今天的價格)|5.95|0.0038| |RNN(LSTM)(使用連續60天的資料預測今天的價格)|25.77|0.0165| |DNN(使用連續60天的資料預測今天的價格)|28.37|0.0182| |CNN(使用連續60天的資料預測今天的價格)|21.84|0.0140| #### Out Of Sample Test MAE值 > Google平均Close price：2928.04 |Model&資料|MAE|MAE/AVE(Close Price)| |:-:|:-:|:-:| | Linear Regression(使用 high,low,open,volume預測 adj close) |18.29|0.0062| |Decision Tree Regression(使用 high,low,open,volume預測 adj close)|34.00|0.0116| |Linear Regression(使用1天前的價格預測今天的價格)|36.15|0.0123| |Decision Tree Regression(使用1天前的價格預測今天的價格)|65.71|0.0224| |RNN(LSTM)(使用連續60天的資料預測今天的價格)|49.68|0.0170| |DNN(使用連續60天的資料預測今天的價格)|55.41|0.0189| |CNN(使用連續60天的資料預測今天的價格)|34.76|0.0119| ### Conclusion 1. 在 In Sample Test 中，最準的模型是 Decision Tree Regression(使用 high,low,open,volume 資料)，最不準的模型是 Deep Neural Network 2. 在 Out Of Sample Test中，最準的模型是 Linear Regression(使用 high,low,open,volume 資料)，最不準的模型是 Decision Tree Regression(使用1天前的價格資料) 3. Decision Tree 在 In Sample Test 中都表現良好，但一遇到未知的資料，預測的準確性就嚴重下降 4. 在預測的兩支股票中，預測Google的準確性比預測TSLA的準確性還高，蠻可能是因為TSLA是在這一兩年間突然暴漲，因此還沒有足夠的資料 5. 在DNN、CNN、RNN三個模型中，在In Sample Test最準的往往都是RNN模型，但到了Out Of Sample Test中，最準的反而是CNN模型 6. 我認為Deep Learning的模型都還有優化空間，例如： * 使用更多的資料(Ex：更多天的資料、參考公司的財報資料、各種財務比率) * 計算一些從price data衍生出來的資料(Ex：三日均線、五日均線) ## Reference * My Code：[EDA](https://colab.research.google.com/drive/1LwKT2TLX1g_woCf1cU5TUwgtpCxS94YX?usp=sharing)、[TSLA colab](https://colab.research.google.com/drive/1FeGQkXK87wEDO930_Xr2V0EZ1JmC0AlH?usp=sharing)、[GOOG colab](https://colab.research.google.com/drive/1akOjsnGaQboFRZG5g3CkC-SuTmdeKnyX?usp=sharing) * [Predicting Stock Prices with Linear Regression in Python](https://www.alpharithms.com/predicting-stock-prices-with-linear-regression-214618/) * [Deep Learning](https://www.nature.com/articles/nature14539) * [Convolutional neural networks: an overview and application in radiology](https://insightsimaging.springeropen.com/articles/10.1007/s13244-018-0639-9) * [Short-term stock market price trend prediction using a comprehensive deep learning system](https://journalofbigdata.springeropen.com/articles/10.1186/s40537-020-00333-6) * [Predicting Stock Prices Using Machine Learning](https://neptune.ai/blog/predicting-stock-prices-using-machine-learning) ---