平台 / Azure

平台 / Azure === ###### tags: `ML / 時間序列` ###### tags: `ML`, `時間序列`, `Azure` [TOC] ## News ### [Azure機器學習服務正式推出時間序列預測功能](https://www.ithome.com.tw/news/131175) - **重點** - **特徵工程** -> 提高預測能力 - **時間特徵** - 假日功能：銷售預測容易受假日影響 [![](https://i.imgur.com/f7uMAaz.png)](https://docs.microsoft.com/zh-tw/azure/open-datasets/dataset-catalog#supplemental-and-common-datasets) - **時間視窗的滯後(lags)變數** > #target_lags > [![](https://i.imgur.com/GL7ieAb.png)](https://docs.microsoft.com/zh-tw/azure/machine-learning/how-to-auto-train-forecast#configuration-settings) - **時間視窗的聚合(aggration)功能** > #target_rolling_window_size [![](https://i.imgur.com/HqgvuuW.png)](https://docs.microsoft.com/zh-tw/azure/machine-learning/how-to-auto-train-forecast#configuration-settings)) - 能源需求預測：此方法特別有用 - **交叉驗證** > 功用：量測和減少模型採樣錯誤的重要程序 - Rolling Origin Cross Validation 時間序列交叉驗證滾動原點交叉驗證 - **推論** - 預測函式，能預測上下文 -> 減少即時模型重新訓練 [![](https://i.imgur.com/py7ptAo.png)](https://i.imgur.com/py7ptAo.png) - **起因**：物連網的**感測器**會不斷地產生資料，不太可能一直反覆訓練龐大的資料 - **作法**：就是檢索預測值，內插必要的訓練與預測上下文 - 預測設定範例 | 中文屬性名稱 | 英文屬性名稱 | 屬性值 | |------------|------------|------| | 時間資料行 | Time column | Month | | 預測範圍 | Forecast horizon | 30 | | 預測目標延隔 | Forecast target lags | 24 | | 目標移動視窗大小 | Target rolling window size | 24 | | 季節性與趨勢 | Season and trend | Season and trend | | 國家或地區的假日 | Country or region for holidays | United States (US) | | 歷程資料的收集頻率 | Frequency of collected historic data | Month | | 目標彙總函式 | Target aggregation function | None | ### [微軟雲端機器學習服務，推出多項時間序列新功能 Build more accurate forecasts with new capabilities in automated machine learning](https://azure.microsoft.com/zh-tw/blog/build-more-accurate-forecasts-with-new-capabilities-in-automated-machine-learning/) > 資料來源：[Azure機器學習服務正式推出時間序列預測功能](https://www.ithome.com.tw/news/131175) > ![](https://i.imgur.com/ouS5WPD.png =75%x) <hr> ## Azure Machine Learning 文件 ### [教學課程：使用自動化機器學習來預測需求](https://docs.microsoft.com/zh-tw/azure/machine-learning/tutorial-automated-ml-forecast) > 教學課程 > Studio > 自動化 ML (UI) > 預測需求 (自行車共用資料) > #bikeshare | 其他設定 | 描述 | 教學課程的值 | |--------|------|------------| | 主要計量 | 用於測量機器學習演算法的評估計量。 | 標準化均方根誤差 | | 解釋最佳模型 | 自動在自動化 ML 所建立的最佳模型上顯示可解釋性。 | 啟用 | | 封鎖的演算法 | 您要從定型作業中排除的演算法 | 極端隨機樹狀結構 | | 其他預測設定 | 這些設定有助於改善模型的正確性。 **預測目標延隔**：您想要將目標變數的延隔往回建構多久 **目標滾動時間範圍**：指定將會產生特徵 (例如「最大值」、「最小值」和「總和」) 的滾動時間範圍大小。 | 預測目標延隔：None 目標滾動時間範圍大小：None | | 結束準則 | 如果符合條件，訓練作業就會停止。 | 定型作業時間 (小時)： 3 計量分數閾值： None | | 驗證 | 選擇交叉驗證類型與測試次數。 | 驗證類型： K 折交叉驗證驗證次數：5 | | 並行 | 每個反覆運算已執行的平行反覆運算數目上限 | 並行反覆運算上限： 6 | ### [使用 Python 設定 AutoML 來訓練時間序列預測模型](https://docs.microsoft.com/zh-tw/azure/machine-learning/how-to-auto-train-forecast) > 操作指南 > 將模型定型 > 自動化機器學習 > 自動定型預測模型 <hr> ## [[服務] Azure 時間序列深入解析](https://azure.microsoft.com/zh-tw/services/time-series-insights/) <hr> ## 資料集：AirPassengers ### 相同演算法，不同正規化，只挑最好的 R2 | 演算法名稱 | R2 | |------------------------------------|----------| | VotingEnsemble | 0.88346 | | ExponentialSmoothing | 0.82321 | | SparseNormalizer, XGBoostRegressor | 0.79509 | | MinMaxScaler, ExtremeRandomTrees | 0.73817 | | MaxAbsScaler, LightGBM | 0.73171 | | MinMaxScaler, RandomForest | 0.70417 | | SeasonalNaive | 0.61156 | | MaxAbsScaler, ElasticNet | 0.60552 | | ProphetModel | 0.52864 | | AutoArima | -0.00395 | | SeasonalAverage | -0.28031 | | Naive | -0.63118 | | MinMaxScaler, GradientBoosting | -0.64355 | | Average | -1.00000 | - ### [ForecastingModelClassNames 類別](https://docs.microsoft.com/zh-tw/python/api/azureml-automl-core/azureml.automl.core.shared.constants.modelclassnames.forecastingmodelclassnames?view=azure-ml-py) - Arimax - AutoArima - Average - ExponentialSmoothing - Naive - Prophet - SeasonalAverage - SeasonalNaive - TCNForecaster ### 時間序列用到的正規化方法： | 正規化方法 | |-----------------------| | SparseNormalizer | | StandardScalerWrapper | | MinMaxScaler | | MaxAbsScaler | | RobustScaler | | PCA | ### The Best: VotingEnsemble - ### 主成份 > XGBoostRegressor x 0.45 + ExponentialSmoothing x 0.54 - **SparseNormalizer, XGBoostRegressor** - 集成權數: 0.45454545454545453 - 資料轉換: ```json= { "class_name": "SparseNormalizer", "module": "automl.client.core.common.model_wrappers", "param_args": [], "param_kwargs": { "norm": "l1" }, "prepared_kwargs": {}, "spec_class": "preproc" } ``` - 訓練演算法 ```json= { "class_name": "XGBoostRegressor", "module": "automl.client.core.common.model_wrappers", "param_args": [], "param_kwargs": { "booster": "gbtree", "colsample_bytree": 0.8, "eta": 0.05, "grow_policy": "lossguide", "max_bin": 63, "max_depth": 3, "max_leaves": 0, "n_estimators": 100, "objective": "reg:linear", "reg_alpha": 1.0416666666666667, "reg_lambda": 1.0416666666666667, "subsample": 0.5, "tree_method": "hist" }, "prepared_kwargs": {}, "spec_class": "sklearn" ``` - **ExponentialSmoothing** - 集成權數: 0.5454545454545454 ```json= { "spec_class": "timeseries", "class_name": "ExponentialSmoothing", "module": "automl.client.core.common.forecasting_models", "param_args": [], "param_kwargs": {}, "prepared_kwargs": {} } ``` - ### Metrics | Metrics | 指標 | 數值 | |----------------------------------------|----------------------|----------| | Explained variance | 解釋的變異 | 0.90773 | | Mean absolute error | 平均絕對誤差 | 15.222 | | Mean absolute percentage error | 平均絕對百分比錯誤 | 3.9589 | | Median absolute error | 中間值絕對錯誤 | 14.007 | | Normalized mean absolute error | 標準化平均絕對錯誤 | 0.037960 | | Normalized median absolute error | 標準化中間值絕對錯誤 | 0.034929 | | Normalized root mean squared error | 標準化均方根誤差 | 0.045756 | | Normalized root mean squared log error | 標準化均方根記錄錯誤 | 0.029694 | | R2 score | R2 分數 | 0.88346 | | Root mean squared error | 均方根錯誤 | 18.348 | | Root mean squared log error | 均方根記錄錯誤 | 0.046696 | | Spearman correlation | Spearman 相互關聯 | 0.97343 | ### 實驗參數 ([文件：組態設定](https://docs.microsoft.com/zh-tw/azure/machine-learning/how-to-auto-train-forecast#configuration-settings)) ![](https://i.imgur.com/ilmzOzE.png) - **時間資料行 (參數名稱：`time_column_name`)** > 有時間戳記的資料行，不限定是數值，也可以是 'yyyy-mm' 之類的。 - 來自 UI 的解釋： > 包含時間序列資料時間戳記的資料行。 - 來自文件的解釋： > 用來指定輸入資料中用來建置時間序列並推斷其頻率的日期時間資料行。 - **時間序列識別碼 (參數名稱：`time_series_id_column_names`)** > 時間戳記一樣時，要再根據什麼排序？ - 來自 UI 的解釋： > 當資料中有多個資料列的時間戳記相同時，這些資料行名稱會用來單獨識別時間序列。若未定義，則會假設資料集有一個時間序列。 - 來自文件的解釋： > 資料行名稱 (s) 用來唯一識別資料中有多個資料列具有相同時間戳記的時間序列。如果未定義時間序列識別碼，則會假設資料集為一個時間序列。若要深入了解單一時間序列，請參閱 energy_demand_notebook。 - **頻率 (參數名稱：`freq`) (預設：自動偵測)** ![](https://i.imgur.com/cmhWNUf.png) - 來自 UI 的解釋： > 歷程資料的收集頻率。 - 來自文件的解釋： > 時間序列資料集頻率。此參數代表預期發生事件的期間，例如每日、每週、每年等等。頻率必須是 pandas 位移別名。深入瞭解 [頻率]。 (#frequency--目標-資料匯總) :::warning :bulb: 如果是由「不同週期」的資料所組成的複合時間序列，頻率要如何推算？ ::: - **預測範圍 (參數名稱：`forecast_horizon`)** > Forecast horizon: 預測期、預測界限 > ![](https://i.imgur.com/drtE1v4.png) > ([資料來源](https://www.ithome.com.tw/news/131175)) - 來自 UI 的解釋： > 預測區間是您要預測的未來週期數目。此整數區間以時間序列頻率為單位 (例如每日、每週)。 - 來自文件的解釋： > 定義您想要預測的期間數。範圍是以時間序列頻率為單位。單位是以預測器應預測出的定型資料時間間隔為基礎，例如，每月、每週。 - **啟用深度學習（參數名稱：`enable_dnn`）** 如 TensorFlowDNN, TensorFlowLinearRegressor - [ ] 不啟用就是設定封鎖「深度學習演算法」 - [x] 啟用就是允許「深度學習演算法」 <hr> - **組態設定** ![](https://i.imgur.com/zjT9Zb0.png) --- - **其他組態** ![](https://i.imgur.com/ukk0VL9.png) - **預測目標延隔 (參數名稱：`target_lags`)** - 來自 UI 的解釋： > 要根據資料頻率延隔目標值的資料列數目。延隔會以清單或單一整數來表示。當獨立變數與相依變數之間的關聯性預設不相符或相互關聯時，應該使用延隔。 - 來自文件的解釋： > 要根據資料頻率延隔目標值的資料列數目。這會以單一整數表示。根據預設，當獨立變數與相依變數之間的關聯性不相符或相互關聯時，應使用延隔。例如，在嘗試預測產品的需求時，任一月份的需求可能會相依於特定商品在前 3 個月內的價格。在此範例中，建議您將目標 (需求) 往前延隔 3 個月，這樣模型才會根據正確的關聯性進行訓練。 :::warning :warning: **注意** 啟用「**預測目標延隔**」、「**目標移動視窗大小**」，forcast 就只能預測第一個時間點，後面的時間點都會報錯。亦即，最後的資料時間點為 t0, 只能預測下一個 t1 時間點，無法預測 t2, t3, t4, ... 時間點。 ::: :::warning :warning: **錯誤訊息** ```json { "error": "DataException: Message: No y values were provided. We expected non-null target values as prediction context because there is a gap between train and test and the forecaster depends on previous values of target. If it is expected, please run forecast() with ignore_data_errors=True. In this case the values in the gap will be imputed. InnerException: None ErrorResponse: { "error": { "code": "UserError", "message": "No y values were provided. We expected non-null target values as prediction context because there is a gap between train and test and the forecaster depends on previous values of target. If it is expected, please run forecast() with ignore_data_errors=True. In this case the values in the gap will be imputed.", "inner_error": { "code": "BadData", "inner_error": { "code": "TimeseriesNoDataContext" } }, "reference_code": "6f772f24-febc-11ea-ae51-04d3b0c6010a" } } } ``` ::: - **目標移動視窗大小 (參數名稱：`target_rolling_window_size`)** - 來自文件的解釋： > 要用來產生預測值的 n 個歷程記錄週期，小於或等於定型集大小。如果省略，則 n 就是完整的定型集大小。若在將模型定型時只想考慮特定數量的歷程記錄，則請指定此參數。 <hr> ## [Debug] Dataset: Air-Passengers ### 新增 x1, x2 特徵 > `x1 * 10 + x2 = target` - EsponentialSmoothing - ### Test: 1958/09, target=404 ![](https://i.imgur.com/EiwJudv.png) - ### Test: 1958/12, target=337 ![](https://i.imgur.com/s0tFmvC.png) ### %Y-%m-%d - ### EsponentialSmoothing: 一直線 ``` 0 1958-09-01 | 404 <-> 440.8770886970721 1 1958-10-01 | 359 <-> 440.8770886970721 2 1958-11-01 | 310 <-> 440.8770886970721 ... 25 1960-10-01 | 461 <-> 440.8770886970721 26 1960-11-01 | 390 <-> 440.8770886970721 27 1960-12-01 | 432 <-> 440.8770886970721 ``` - R2: -0.0013420918978028773 - RMSE: 78.58747944224389 - ### ProphetModel: ![](https://i.imgur.com/OIg2yXt.png) ``` 0 1958-09-01 | 404 <-> 419.305050802139 1 1958-10-01 | 359 <-> 389.8801258535971 2 1958-11-01 | 310 <-> 365.7606775069996 3 1958-12-01 | 337 <-> 389.5987707144491 4 1959-01-01 | 360 <-> 397.9983924864402 ... 23 1960-08-01 | 606 <-> 533.1826004359499 24 1960-09-01 | 508 <-> 497.26132010487396 25 1960-10-01 | 461 <-> 467.6794440593952 26 1960-11-01 | 390 <-> 438.8454295124484 27 1960-12-01 | 432 <-> 466.66006036002705 ``` - R2: 0.7359690326342093 - RMSE: 40.35427572504492 - ### VotingEnsemble ![](https://i.imgur.com/5UF8BRd.png) ![](https://i.imgur.com/ZrASpWc.png) ExponentialSmoothing 權重佔 93.33% ``` 0 1958-09-01 | 404 <-> 440.8131313641731 1 1958-10-01 | 359 <-> 437.792446178136 2 1958-11-01 | 310 <-> 435.61363437988695 ... 25 1960-10-01 | 461 <-> 438.41202564658755 26 1960-11-01 | 390 <-> 435.4013754578191 27 1960-12-01 | 432 <-> 437.1994129353259 ``` - R2: 0.06621549213181943 - RMSE: 75.89015778714167 <hr> ## [Debug] Dataset: bikeshare ### [v1] 特徵包含 instant,casual,registered > Train: 1/1/2011 ~ 12/31/2012 - EsponentialSmoothing - ### Test: 1/1/2013 ![](https://i.imgur.com/cfvr8Z1.png) - ### Test: 1/2/2013 ![](https://i.imgur.com/xGgYK6T.png) - ### Test: 1/31/2013 ![](https://i.imgur.com/ywCfii8.png) - arima: - dataset ```python= df = pd.read_csv('bike-no.csv') df['date'] = df['date'].aggregate(lambda x: x.replace('2011', '2013')) df['yr'] = df['yr'].aggregate(lambda x: 2 if x == 0 else (3 if x == 1 else -1)) df['weekday'] = df['weekday'].aggregate(lambda x: (x+3)%7) df['instant'] = df['instant'].aggregate(lambda x: x+731) instant = df['instant'] df.drop(columns='instant', inplace=True) df.insert(1, 'instant', instant) ``` ![](https://i.imgur.com/hHwDOtl.png) - predict ```python= def get_data_in_dict(no): data_in_dict = {} for c in range(len(df.columns)-1): value = df.iloc[no][c] if type(value) == np.int64: value = int(value) data_in_dict[df.columns[c]] = value return data_in_dict print(get_data_in_dict(0)) ``` ```python= for no in range(15, 45): data_in_dict = get_data_in_dict(no) #print(data_in_dict) y_true = df['cnt'][no] y_pred = predict(data_in_dict) print(no, df['date'][no], '|', y_true, '<->', y_pred) ``` ![](https://i.imgur.com/0TySPUt.png) ### [v2] 特徵不含 instant,casual,registered > Train: 1/1/2011 ~ 8/7/2012 (#585) > Test: 8/8/2012 ~ 12/31/2012 (#146) - VotingEnsemble: test: ``` 0 8/8/2012 | 7534 <-> 7072.026316266562 1 8/9/2012 | 7286 <-> 7145.842188045083 2 8/10/2012 | 5786 <-> 6439.389945118439 3 8/11/2012 | 6299 <-> 6594.901191409788 4 8/12/2012 | 6544 <-> 6795.013792893975 5 8/13/2012 | 6883 <-> 6798.6542891665495 ... 141 12/27/2012 | 2114 <-> 3016.0935052933332 142 12/28/2012 | 3095 <-> 3489.22407675509 143 12/29/2012 | 1341 <-> 3270.661316407552 144 12/30/2012 | 1796 <-> 3251.650973562273 145 12/31/2012 | 2729 <-> 3141.6894130786895 ``` - R2: 0.6321868248953026 - RMSE: 1138.694670158633 - VotingEnsemble: train(X=['Train'], y='cnt'): test: ``` 0 8/8/2012 | 7534 <-> 7227.139957477068 1 8/9/2012 | 7286 <-> 7138.3329699209835 2 8/10/2012 | 5786 <-> 7143.269660720456 3 8/11/2012 | 6299 <-> 6915.16689980422 4 8/12/2012 | 6544 <-> 6700.404544003703 5 8/13/2012 | 6883 <-> 7052.568184486838 ... 141 12/27/2012 | 2114 <-> 8652.378613155686 142 12/28/2012 | 3095 <-> 8594.877106937422 143 12/29/2012 | 1341 <-> 8355.526908730173 144 12/30/2012 | 1796 <-> 8229.97351797278 145 12/31/2012 | 2729 <-> 8382.32600419411 ``` - R2: 1.4780705071281757 - RMSE: 2955.632793480974 - EsponentialSmoothing: test: ``` 0 8/8/2012 | 7534 <-> 6893.2517263733325 1 8/9/2012 | 7286 <-> 6893.2517263733325 2 8/10/2012 | 5786 <-> 6893.2517263733325 3 8/11/2012 | 6299 <-> 6893.2517263733325 4 8/12/2012 | 6544 <-> 6893.2517263733325 5 8/13/2012 | 6883 <-> 6893.2517263733325 ... 141 12/27/2012 | 2114 <-> 6893.2517263733325 142 12/28/2012 | 3095 <-> 6893.2517263733325 143 12/29/2012 | 1341 <-> 6893.2517263733325 144 12/30/2012 | 1796 <-> 6893.2517263733325 145 12/31/2012 | 2729 <-> 6893.2517263733325 ``` - R2: -0.286438163772343 - RMSE: 2129.5512915365766 <hr> ## Bug - 執行到後面，陷入無窮迴圈 ![](https://i.imgur.com/JBMxHMm.png) ` 最後超過 24h 時間上限而中止` <hr> ## 參考資料（待消化） - ### [Azure 資料總管中的時間序列分析](https://docs.microsoft.com/zh-tw/azure/data-explorer/time-series-analysis) - ### [使用 Python 設定 AutoML 來訓練時間序列預測模型](https://docs.microsoft.com/zh-tw/azure/machine-learning/how-to-auto-train-forecast)