Modal decomposition-based hybrid model for stock index prediction

# Modal decomposition-based hybrid model for stock index prediction ## :book: Modal decomposition-based hybrid model for stock index prediction ### :bookmark:Abstract * (股票預測為挑戰性議題)Stock index prediction is considered one of the most challenging issues in the financial sector(領域) owing to its noise,volatility, and instability. * (傳統方法的缺點去除噪音、弱挖掘)Traditional stock index prediction methods, such as statistical and machine learning methods, cannot achieve a high denoising effect, and also cannot mine enough data features from the stock data,resulting in a poor prediction performance. * (深度學習的優點，混和學習效能有很大的改進空間)Deep learning has become an effective tool to predict non-stationary(非平穩) and nonlinear stock indices(指數) with strong learning ability.However, there is still room for prediction accuracy improvement if a single deep learning prediction model is replaced with a hybrid model. 1. (CEEMDAN-DAE-LSTM與經驗模式分解成本質模態函數)Therefore, this study proposes a novel deep learning hybrid model for stock index prediction named CEEMDAN-DAE-LSTM . In this hybrid model, the stock index is first decomposed(分解) using complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN) into a series of intrinsic mode functions (IMFs 本質模態函數) arranged from high to low frequency. 2. (DAE用來去除攏於資料，抽取特徵)Next, the deep autoencoder (DAE) is applied to remove redundant data and extract deep-level features. 3. (丟到LSTM預測明日交易價)Then, high-level abstract features are separately fed into long short-term memory (LSTM) networks to predict the stock returns of the next trading day. 4. (綜合所有元素獲得預測值)Finally, the final predicted value is obtained by synthesizing(合成) the value of each component. * (經驗分析結果)Empirical research results on six stock indices representing both developed and emerging markets showed that our model is superior to other reference models in terms of(以..評估特殊限制) prediction accuracy and stock index trends(股指趨勢); * (好的預測結果即使有大的波動和不同的股票市場)furthermore, it has higher prediction performance for stock indices with greater volatility. In general, this model could be applicable to various stock markets with different degrees of development. ### :bookmark:Introduction * (股票市場是噪音和非線性的系統)The stock market is a noisy, nonlinear dynamic system (Si and Yin, 2013). * (預測方法可以破壞這個假設)However, many different stock index prediction methods have been proposed to break the efficient market hypothesis. * (股指預測方法可以降低投資決策風險，增加資產收益)The methods for accurate prediction of stock indices, which can reduce risks associated with investment decisions and increase asset returns, have become a research hotspot in recent years. * (股票預測方法主要分為三個類別，基於統計、基於ㄐ器學習、基於深度學習，並且他們持續發展中並用於許多領域)The methods for stock index prediction can be roughly categorized into three types: statistical based, machine learning based, and deep learning based. These methods have been continuously improved by researchers, and are widely used in various fields (Banan et al., 2020; Taormina and Chau, 2015; Wang et al., 2020) * (統計方面非平穩時間序列廣泛被採用)Among these methods, the ARIMA model of non-stationary time series modeling has been extensively adopted. (好處:低時間複雜度、快速運算)The statistical stock index forecasting methods usually have a low complexity and a fast calculation speed. (壞處:預測會有限制，有太多變數像似政治因素、謠言、和財政政策影響) However, they have limitations on stock prediction, because stock indices are often affected by rumors, political changes, and fiscal policies, making them difficult to be predicted. * (MR)(壞處需要人工提取特徵和低適應能力)the machine learning approaches also have disadvantages such as manual data features extraction and poor adaptability. * (DP為預測股票市場的主力，因為其高能力表達複雜問題，並從中抽取重要特徵)Deep learning, as a deep structure learning technology, has gradually become the main research direction for stock index prediction owing to its higher ability to express complex problems and extract essential features from complex data (Nti et al., 2019). * (DP搭配學術框架，去躁、特徵抽取、時間序列擬和達成更高準確性)the deep learning hybrid prediction model can adopt a theoretical framework for data denoising, deep feature extraction, and financial time series fitting, to combine the advantages of each module to improve the accuracy of stock index prediction. ![](https://i.imgur.com/I8xgVHd.png) ### :bookmark: 圖1 * (因為股票指數多尺度變動，raw data經過三步驟，分解為由IMFS本質模態函數組成的特徵列)Raw data is processed through three steps, i.e., modal decomposition, feature extraction, and time series fitting, to produce a prediction model.**Since stock indices have the characteristics of multi-scale volatility,modal decomposition is used to decompose the raw data into a series of intrinsic mode functions (IMFs).** Each IMF describe features of the raw data at a certain frequency. 1. (小波)(利用模態分解，噪音消除有助於提高準確性，主要利用小波分析->數據分解成不同頻率，但有適配性問題不適用於所有資料，因此需要EMD經驗模態分解->時域處理方法，具有時間尺度的特性將資料轉為局部特徵訊號，可克服小波缺點，讓資料有更高的適配性)With modal decomposition, the noise of the data is eliminated, which is conducive to improve the prediction accuracy. **The modal decomposition methods for stock indices include wavelet transform and empirical mode decomposition (EMD)** (Huang et al., 1998). Wavelet transform can decompose data into wavelets of different frequencies. Nevertheless, because the basic function of the wavelet transform is **not adaptable**, there is no basis function suitable for all scenarios. 2. (EMD)Instead, the EMD method, as a time-domain processing method based on the **time scale characteristics** of a signal itself, overcomes the shortcomings of wavelet transform and has stronger adaptability by decomposing the original complex signals into IMFs of different scale characteristics adaptively. * (講解特徵工程兩個，DAE和PCA，PCA可以高維度降維成低維度，提高準確性，但股指通常是非線性的資料，僅降維會有誤差(高維度也會有線性，因為高維度通常沒辦法直接計算所以要利用線性代數取得向量轉為矩陣，進行向量的疊加運算))In the next step, feature extraction is employed to discover the indepth characteristics of the data. **Principal component analysis (PCA) (Wold et al., 1987) and autoencoder (Le, 2015) are two common feature extraction solutions**.Zheng and He (2021) used PCA and recurrent neural network (RNN) to construct a model, and found that PCA can map high-dimensional data to a low-dimensional space through linear projection to improve the stock index prediction accuracy. Nevertheless, when PCA is applied to stock data, it produces some errors because **it can only reduce the dimensions of linear data, but stock indices are highly nonlinear.** * (DAE與小波變換和LSTM於六個股指效能優良)Bao et al. (2017) adopted the autoencoder for the first time, along with wavelet transform and LSTM to predict stock indices. The results of six stock market indices proved that the model was much better in terms of prediction accuracy and profitability performance than other reference models. * (DAE是自動編碼器得疊加，L學者利用此於香菸光譜研究，比PCA抓到更多非線性特徵代表香菸品質，DAE於線性非線性資料都適合使用)As an improved version of autoencoder, **deep autoencoder (DAE) is a superposition of multiple autoencoders** to obtain higher performance. Liu et al. (2017) proposed a neural network based on the DAE for feature extraction of spectral data in cigarettes. After comparing DAE with PCA, it was observed that DAE can extract more nonlinear data features to represent cigarette quality. **The DAE is more appropriate to process complex data because it can not only characterize linear transformation but also nonlinear ones.** * (RNN和LSTM做時間序列的主要方法，RNN是專門處理時間序列訊息的網路，但訓練資料周期過長會有梯度消失的問題，因此有LSTM長期短期網路，可以學習更長期的資料，因此成為最受歡迎的時間序列學習模型)RNN (LeCun et al., 2015) and long short-term memory (LSTM) networks (Hochreiter and Schmidhuber, 1997) are regarded as the main approaches for time series fitting. An RNN is a time-dependent network specially used to process time series information. As a variant of RNN,**LSTM solves the gradient vanishing problem of RNN**, and it can learn longer-term time series information, which makes it the most popular choice for a variety of tasks. * (重要研究，Y和A學者做CEEMD-PCA-LSTM，先利用CEEMD模態分解成IMFS本質模態函數，轉換後可以輸入LSTM預測明天的股票交易價格，但PCA只適合線性資料，特徵工程會有誤差)Yan and Aasma (2020) constructed deep learning hybrid prediction model named **CEEMD-PCA-LSTM**, where the stock returns were first decomposed by CEEMD into IMFs of different frequencies, then were extracted more data features from stock data by PCA, and finally, each component was used as the input of the subsequent module to predict stock returns on the next trading day. However, as a data feature extraction module, **PCA can only act on linear data. When facing with high volatility stock indices, PCA based feature extraction will result in certain errors.** * (以前研究中ARMA自回歸移動平均模型處理線性資料，LSTM處理非線性，但還有進步空間，因為ARMA不適和處理非線性資料(高波動性)，因此利用DAE透過多層隱藏層的疊加抽取線性與非線性的特徵。)In our previous work (Lv et al., 2022), **an autoregressive moving average (ㄒ) model is utilized to process linear data, and LSTM networks are used to process nonlinear data.** However, this solution still has room for improvement, since ARMA is a statistical methods and it cannot fully extract features from the stock index with **high volatility**.To overcome the limitations, **DAE is used as a feature extraction module to characterize both linear and nonlinear transformations** which is adapted to extract deep features by **superimposing multiple hidden layers** and is more suitable for processing complex stock data. #### **At present, few efforts have been made to investigate whether DAE could be applied to stock index prediction. Therefore, this paper contributes to this area and introduces DAE into stock index prediction** ### :bookmark:CEEMDAN-DAE-LSTM CEEMDAN-> is used as a data preprocessing module to **smooth non-stationary stock indices** and to **decompose a series of IMFs from high to low frequency**; DAE ->is employed as a **feature extraction module** to characterize both **linear and nonlinear transformations**, which is adapted to extract deep features **by superimposing multiple hidden layers;** LSTM -> is utilized to perform one-step ahead prediction for each component. **Finally, all prediction results are combined to get the final prediction value.** DATA -> six representative stock indices. The experimental results showed that our proposed model outperforms other reference models in term of prediction accuracy, especially for **stock indices with greater volatility.** ### :bookmark:Preliminaries(初步) CEEMDAN -> EMD可以把複雜訊號分解成不同時間尺度但各具有特徵的IMFS，但可能會造成模態混疊的狀態(不同時間時間尺度出現在同一個分解函數，同一特徵尺度分化為不同分解函數，失去單一特徵尺度的特性(同一特徵尺度出現在同一個分解函數)，) ，因此有了EEMD 而後CEEMDAN 在分解過程加入白噪音，可以解決模態混疊的問題，擁有更小更完整的訊號。 DAE -> DAE為堆疊自動解碼器，藉由輸入噪音資料及乾淨資料來進行學習，盡可能地還原出被破壞前的資料。第一層輸入值後進行破壞，並形成一個重建曾，重建層的輸出是下一個隱藏層的輸入(loop)，再進到下一個隱藏曾(下一個自動編碼器)，而重建主要學習中間隱藏層的訊息「藉由Encode(編碼模糊化)再Decode(解碼試圖還原出起始資料)進行學習」學習過後輸出x^計算重構誤差，進行梯度優化演算法更新編碼器解碼器的參數。主要應用資料壓縮、訊息檢索 ![](https://i.imgur.com/rCCi6IR.png) LSTM -> RNN主要用來做時間序列的研究，LSTM是用來解決RNN訓練長期間導致的梯度消失和梯度爆炸的問題，比RNN能做更長期的預測，而LSTM由遺忘門、輸入門、輸出門組成遺忘門:決定單元狀態要記憶或丟棄的訊息，輸入資訊會轉換成one hot code馬，再用sigmoid作加權控制(做逐元素乘法運算保留*1 遺忘*0)來決定資料資料是否通過輸入門:決定何者新訊息會被儲存在狀態中用sigmoid和tanh做控制輸出門:首先，將單元狀態更新，利用sigmoid來確定書出的訊息再和tanh單元狀態結合得到最終的預測結果向量和 one-hot 編碼 -> ![](https://i.imgur.com/Pu70oDz.png) ![](https://i.imgur.com/c9UUtWD.png) ![](https://i.imgur.com/gEGGztd.png) 簡化過後的RNN模型 ![](https://i.imgur.com/L0hmwxt.png) 擠壓函數（雙曲正切函數 tahn 帶有完整圓形）->**(無論X值小有相對應小y,但X增大到無限增大，Y控制-1~1之間)** 除了模型本身，這張圖還包括了一個我們前面沒提到的符號。這個波浪符號代表擠壓函數（squashing function，又譯作 S 函數），它可以幫助整個神經網路更好運作。擠壓函數的功用，是將模型的投票結果限制在特定範圍之間。比方說，如果有個投票結果得到 0.5 的值，我們可以在擠壓函數上畫一條 x = 0.5 的垂直線，並得到水平對應的 y 值，也就是擠壓過後的數值。對於小的數值而言，原始數值和擠壓過構的數值通常很相近，但隨著數值增大，擠壓過後的數值會越來越接近 1。隨著數值愈趨負無窮大，擠壓過後的數值也會越來越接近 -1。不論如何，擠壓過後的數值都會介於 1 和 -1 之間。 ![](https://i.imgur.com/TaoYIgC.png) 擠壓函數的處理，對於神經網路這種重複運算相同數值的流程非常有用。比方說，如果有個選項每次都得到兩次投票，它的數值也會被乘以二，隨著流程重複，這個數字很容易被放大成天文數字。藉由確保數值介於 1 和 -1 之間，即使我們將數值相乘無數次，也不用擔心它會在循環中無限增大。**這是一種負回饋（negative feedback）或衰減回饋（attenuating feedback）的例子。** 上圖中，我們新增的關鍵是記憶／遺忘（memory and forgetting）路徑，用於幫助模型能記住幾個循環前發生的事情。為了解釋記憶部分的運作原理，我們需要先認識幾個新的符號。 Gate ->**(藉由逐元素加法、逐元素乘法對one -hot -code進行運算來達到控制)** 在每個水管上都有一個龍頭，可以用來全開、全關或任意水量，讓訊號流通或堵塞。所以在這個例子裡，全開的龍頭有乘數 1，而全關的龍頭則有乘數 0。根據前面提到的逐元素乘法，我們可以在一開始將 0.8 乘上全開的 1，得到原本的訊號 0.8，也會在最後將 0.8 乘上 0，得到被遮蔽的訊號 0。中間的 0.8 則會乘上 0.5，得到一個比較小、衰減過後的訊號。這組閘門（gate）可以讓我們控制訊號的流通與否，非常有用。 ![](https://i.imgur.com/zKiT4BQ.png) ![](https://i.imgur.com/a48xgo1.png) ![](https://i.imgur.com/0buQiZp.png) 擠壓函數（sigmoid邏輯函數帶有平底的圓形輸出值介於0~1）-> 為了實現上述的閘門，我們需要一組介於 **0 和 1 之間的數值**，所以這裡又有另一種壓縮函數。這個函數的符號是一個帶有平底的圓形，它被稱作邏輯函數（logistic function）。邏輯函數和我們前面提到的雙曲正切函數（hyperbolic tangent function）很類似，除了前者的輸出值介於 0 和 1 之間，而非後者的 -1 和 1 之間。 ![](https://i.imgur.com/G3YVhxK.png) 1. CEEMDAN -> 數據預處理，自適應添加白噪音，解決模態混疊問題，將不穩定的股票數據分解為本質模態函數進行去躁 2. DAE特徵工程，透過壓縮本質模態函數來刪除冗於資料，並將資料壓縮到一定的數據維度，再進行decode恢復，來提取本質模態函數的特徵。作為深度學習的model，DAE可以從高度非線性的股指獲得特徵。 3. 利用雙層LSTM用於處理時間序列的資料，解決長期依賴性梯度消失和爆炸的問題。 4. Dense層合成各個預測特徵組件，進行綜合運算，得到最終的預測值。 ### :bookmark: Data: ### **datasource** 1. Shanghai Composite Index (SH 上海綜合指數) 2. Shenzhen Stock Exchange Index (SZ 深圳交易所指數) 3. Hang Seng Index (HSI 恆生指數) 4. Nikkei 225 Index (N225 日經225) 5. Dow Jones Index (DJI 道瓊指數) 6. S&P 500 Index (SPX 標準普爾500). * 原因 1. The first reason is that they cover both developed markets (DJI, SPX, N225, HSI) and emerging markets (SH, SZ), which can reflect the adaptability(適配性) of the proposed model. * 2. most relevant and state-of-art work Yan and Aasma (2020) CEEMD-PCA-LSTM 與PCA不同，此研究用的是DAE for fair comparisons. ### **data poccessing** ![](https://i.imgur.com/etUWfvp.png) * (做一階差值的對數，準確反映出股指的變化)We process the closing index of each stock index into a logarithmic return r using the logarithm of the first-order difference, which can accurately reflect the changes in the stock index. ![](https://i.imgur.com/I8OH6Yo.png) * logarithmic rate(對數回報率，對數回報率的標準差越小股指波動越小，趨勢越明顯) It can be seen that the stock index has the characteristics of non-stationarity and volatility. * (利用前向滾動窗口確保模型的實用性，持續抓取最新資料淘汰舊資料)A forward rolling window is utilized to ensure the practicability of the model, which can capture the latest trend of the index by continuously.deleting old data and including new data. * (選取過去3.5年作為訓練集，0.5年驗證集，隔年上半0.5年為測試集)We selected past 3.5-year data as the training set for model training, and the data from the fits half of the year as the verification set to adjust the model; finally, we used the trained model to predict the data in the latter half of the year. ### **:bookmark: Evaluation metrics** * In this study, four metrics, i.e., root mean square error (RMSE), mean absolute error (MAE), normalized mean square error (NMSE), direction consistency (DC)(方向統一性) are used to evaluate the prediction accuracy of the model. * RMSE(擁有較高的誤差離散值，能精確判斷觀測值和實際值之間的誤差) * MAE(讓損失函數更穩定，對於處理異常值有更好的穩健性，主要看誤差落點) * NMSE (NMSE的值超過1時，表示模型很糟糕，越小越好，可以更好的評估數值的變異程度) * DC(測量觀察值和實際值股指趨勢方向的一致性) ### :bookmark: EMD and CEEMDAN for data preprocessing * 圖7 (利用CEEMDANSH將指數分解為高低頻率的本質模態函數，高頻率代表高度波動性和噪音，低頻率代表相對穩定和股價趨勢)It illustrates the Shanghai Composite Index being decomposed into a set of IMFs sorted from high to low frequency and a residual term. The highfrequency IMFs have greater volatility and contain more noise; the low-frequency IMFs are relatively stable, representing the basic trend of stock prices. (本質模態函數對比 Pearson->IMFs分解後的數據和原始数据之间的相关性，越大越強 proportion of variance->變異比率，越小波動越小，趨勢越明顯 Kurtosis ->IMF資料分布模式的陡峭度，越小越接近常態分佈的陡峭度 Skewness ->IMF資料分布的對稱性，越小數據偏度越接近常態分佈。 ) In summary, CEEMDAN is more appropriate for dealing with nonstationary and nonlinear stock indices than EMD. ![](https://i.imgur.com/OCmyb4I.png) ![](https://i.imgur.com/C65LrPL.png) ### :bookmark: DAE * (輸入輸出設置相同維度來更接近原始資料)The encoding part of the DAE obtains the deep features of the data **by compressing the data to a certain dimensionality and removing redundant information**; then, it restores the data through the decoding part. **The DAE is set with the same input and output dimensions to make the output data as close to the original data as possible.** * (設置code layer，計算不同維度下的RMSE，為DAE最主要的部份)This study **sets the dimensionality of the code layer as (4, 5, 6, 7, 8)** for experiments and calculates the predicted value of the evaluation metric **RMSE in different dimensions**, **which is the main factor that affects the performance of the DAE.** ![](https://i.imgur.com/xR4IkKD.png) (顯示最佳層數)Table 6 shows the optimal code layer dimensions under different data sets. (RMSE越小預測精度越高，輸入輸出維度10，代碼層最佳為6)![](https://i.imgur.com/1udygl8.png) ### :bookmark: LSTM * (依據歷史資料判斷某個未來時間點，窗口尺寸很重要，太短不準確，太長有梯度消失或爆炸風險，以1~20做實驗)LSTM can predict the content of a time step in the future according to the historical sequence. **The size of the time window of LSTM is an important factor affecting prediction accuracy.** * If the time window is too short, the prediction results are inaccurate. If it is too long, the risk of gradient disappearance or explosion is higher. Therefore, this paper selects the window size range from 1 to 20 for experiments to obtain the best time window value. * (表七顯示數據集的最佳窗口) ![](https://i.imgur.com/qsZTvhF.png) (圖九以SH指數為例，当窗口大小为1到4时，RMSE 急剧下降；但当窗口大小大于4时，RMSE处于处于缓慢上升的状态，因此最佳窗口值為4) * According to Fig. 9, when the window size is from 1 to 4, the RMSE drops sharply; but when the window size is larger than 4, the RMSE is in a state of slow rise. **Hence, the optimal window value of the SH index is 4** ![](https://i.imgur.com/9NvurDD.png) ![](https://i.imgur.com/y1igtWb.png) ### :bookmark: Conclusion (1) Our dataset only uses the closing index of stocks, which is relatively single. However, the stock data is very complex and is affected by many factors. Therefore, more data features can be extracted from rich data, such as macroeconomic variables and news sentiment. (2) The model proposed in this paper has three independent modules, which is suitable for separate improvement. More powerful prediction tools can be included for further improvement in the future.