颱風數據分析

第一次會議

收集颱風數據自1965年~2020年資料筆數共有 141 筆,其特徵描述如下:

  • 欄位名稱
    • 年份
    • 颱風編號
    • 颱風名稱
    • 颱風名稱(英文)
    • 近臺強度
    • 近臺最低氣壓(hPa)
    • 近臺7級風暴風半徑(km)
    • 近臺最大風速(m/s)
    • 掃過濕地面積(公頃)
    • 掃過濕地面積(公頃).1
    • 掃過GDP(百萬)
    • 農業損失(百萬)調整後
    • 侵臺路徑分類
    • 警報期間(起)
    • 警報期間(迄)
    • 近臺10級風暴風半徑(km)
    • 警報發布報數

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

特徵分析

  • 輕度颱風有 32 筆

    • 農業損失(百萬)
    • 最小值: 1.85
    • 最大值: 9508.69
    • 平均數: 876.62
    • 中位數: 293.57
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
  • 中度颱風有 66 筆

    • 農業損失(百萬)
    • 最小值: 3.64
    • 最大值: 12822.73
    • 平均數: 2240.57
    • 中位數: 845.08
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →
  • 強烈颱風有 43 筆

    • 農業損失(百萬)
    • 最小值: 7.99
    • 最大值: 15085
    • 平均數: 5198.52
    • 中位數: 4565.69
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →

關聯分析

透過關聯分析計算每個特徵間的彼此關聯程度。其區間值為-1~1之間,數字越大代表關聯程度正相關越高。相反的當負的程度很高我們可以解釋這兩個特徵之間是有很高的負關聯性。

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

特徵工程

由於原始數據提供的特徵量太少,對於建模影響程度不大。因此我們透過中央氣象局提供的 opendata 搜集其他颱風的相關資訊。例如侵臺路徑分類、警報期間(起)、警報期間(迄)、近臺10級風暴風半徑(km)、警報發布報數。

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

農業損失分布

原農業損失分布整體呈現右偏,故以 log transform 轉換,使 target variable 資料分布可更為集中。

  • 轉換前:
    • 數量: 141.000000
    • 平均: 2833.094582
    • 標準差: 3811.248057
    • 最小值: 1.854000
    • 25百分位: 188.852488
    • 50百分位: 953.292082
    • 75百分位: 3950.813265
    • 最大值: 15085.008875

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • 轉換後:
    • 數量: 141.000000
    • 平均: 6.576912
    • 標準差: 2.110223
    • 最小值: 1.048722
    • 25百分位: 5.246247
    • 50百分位: 6.860970
    • 75百分位: 8.281930
    • 最大值: 9.621523
      Image Not Showing Possible Reasons
      • The image file may be corrupted
      • The server hosting the image is unavailable
      • The image path is incorrect
      • The image format is not supported
      Learn More →

將資料切分成 訓練集 與 測試集,資料比數 9:1,並分別對資料進行標準化:

  • 標準化後的訓練集:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • 標準化後的測試集:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

分別以不同的ML模型對資料進行訓練,採用的模型有 Lasso(Linear Model with L1 regularization), SVM, RandomForest, XGBoost,訓練與測試結果如下:

  • Lasso

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  • SVM

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  • RandomForest

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  • XGBoost

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

觀察到在 RandomForest 與 XGBoost 上有 overfitting 的現象,而 Linear 與 SVM 模型卻無法 fit training data。

最後以 Auto ML 工具希望讓電腦自動搜索出最佳模型:

  • TPOT(Auto ML)
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

相較於RandomForest 與 XGBoost,Auto ML 在 testing data 上的 R2 是 0.01 為正值,與之前嘗試的模型在 testing data 上都是負值的狀況相對較好,但仍是 overfitting。