Project17 - HackMD

--- title: Project17 tags: teach:MF --- # Choose the successful IPO :::info 陽交大資財系機器學習與金融科技期末報告，由胡藝馨，李亦涵，鄧惠文共同編輯。最後更新時間2021/12/25。 ::: --- ## 1. 動機 Motivations --- [今年近半破發！分析師警告「小心IPO企業](https://ec.ltn.com.tw/article/breakingnews/3674842) 美股IPO籌資金額去年創下紀錄，不過專家指出現在的市場狀況並不利於這些新掛牌的公司，今年IPO的公司幾乎有一半已經破發。但，2021年約有一半的IPO現在交易價格低於發行價 Building the model to predict the performance of IPO company in the short term ###### keywords: logistic regression / IPO ## 2. 資料視覺化 Exploratory data analysis ### (1.) Data Link ### IPO raw data : [data1](https://www.iposcoop.com/scoop-track-record-from-2000-to-present/) ### S&P500 historical data : [data2](https://finance.yahoo.com/quote/%5EGSPC/) ### (2.) Variable Descreptions This regression used the following variables as explanatory variables: * X1: the day of week of IPO(1=Monday,2=Tuesday,3=Wednesday,4=Thursday,5=Friday) * X2: issuer * X3: total underwriters * X4: Offer price * X5: open price in first day * X6：close price in first day * X7：change price in first day * X8：change price in first day(percent) * X9：lead manager rate * X10：change between open and offer (percent) * X11：change between open and offer ($) * X12：SP close to open change (percent) * X13：SP week change(percent) * X14：home run(triple in year)(1=home run 0=no home run) ### (3.) EDA Result #### Ipo上市日期分布圖 #### 發行日通常在禮拜四，最少在禮拜一發行 ![](https://i.imgur.com/HNFkaLF.png) #### IPO 年份分布長條圖 #### 在2000年~2018年裡，2014年和2015年上市最多新股 ![](https://i.imgur.com/ESGJH4g.png) #### 只有少數的新股在一年內會翻3倍 #### 長期來看獲利不高，決定看短期 ![](https://i.imgur.com/NENZRaF.png) #### 第一天平均收益年度分布圖 #### 第一天平均收益百分比都是正向的 ![](https://i.imgur.com/KPq0Fx4.png) #### 第一天收益中位數年度分布圖 #### 用中位數比較發現，較大的異常值有報酬分布的偏斜 ![](https://i.imgur.com/hZ7Uck7.png) #### 第一天報酬分布圖 #### 回報都在0附近，但有長長的右尾 ![](https://i.imgur.com/2vptFlQ.png) ## 3. 重述問題 Problem formulation ##### 原本 IPO 的 raw data 所提供的資訊量不足，推測可能影響IPO的原因可能有: ##### 1. Issuer 聲望 ##### 2. 承銷商多寡 ##### 3. 前一週S&P500平均(投資人對最近市場整體的看法) ##### 4. 當天S&P500平均(投資人對當天市場整體的看法) #### 再加入以上的變數，用二元模型來預測該IPO是否值得購買(Yes or No) >用邏輯回歸模型(logistic regression)作為 Benchmark >用隨機森林分類器作為精進後的模型，來與Benchmark 比較 >binary classification measures: > Accuracy > precision(準確率)= TP/(TP+FP) > recall(召回率)=TP/(TP+FN) > F-1 score > AUC(ROC 曲線下方面積) >75-25 train-test split >2000-2015年作為訓練集 >2015-2020年做為測試集(out sample) ## 4. 資料分析 Data analysis ### (1.) Benchmark result (logistic regression) |In-sample |Measurement score| |-----|--------| |Accuracy|0.560831 | |precision |0.558099| |recall|0.621569| |F-1 score|0.588126| |AUC|0.560285| |Out-sample |Measurement score| |-----|--------| |Accuracy|0.483544 | |precision |0.459357 | |recall|0.665753| |F-1 score|0.543624| |AUC|0.496406| ### (2.) New-Result(random forest ) |In-sample |Measurement score| |-----|--------| |Accuracy|0.655786 | |precision |0.654974| |recall|0.671242| |F-1 score|0.663008| |AUC|0.499146| |Out-sample |Measurement score| |-----|--------| |Accuracy|0.564458 | |precision |0.552645 | |recall|0.716993 | |F-1 score|0.624182| |AUC|0.499146| ## 5. 結論 Conclusion 1.從 In-sample和 Out-sample 的比較來看，不論是哪個機器學習的模型，In-sample 的表現都是優異於Out-sample 2.從 logistic regression 和 random forest 的比較來看，隨機森林的表現比logistic regression來得好 ## 6. 參考資料 Reference [二元模型的深入解析](https://docs.aws.amazon.com/zh_tw/machine-learning/latest/dg/binary-model-insights.html) [多棵決策樹更厲害：隨機森林](https://ithelp.ithome.com.tw/articles/10272586) [邏輯斯迴歸模型 Logistic Regression](https://pyecontech.com/2020/02/06/python_logistic_regression/) ## 7. 資料與程式存放 Data and code --- ---