時間序列資料集 === ###### tags: `ML / 時間序列` ###### tags: `ML`, `時間序列`, `資料集` <br> [TOC] <br> ## 出生人口預估 - [台灣出生死亡人口數 & 六十年趨勢](https://morrisjfwong.com/blog/台灣出生死亡人口數-六十年趨勢/) ![](https://i.imgur.com/yo9PFMx.png) - [國發會人口圖表](https://ct.org.tw/html/news/3-3.php?cat=10&article=1354805) ![](https://i.imgur.com/b24QRJB.png) <br> <hr> <br> ## 高速公路車流量 - [高公局分析90億筆ETC大資料,預測高速公路何時塞車](https://www.ithome.com.tw/article/98069) ![](https://i.imgur.com/Y7RnieA.png) 其他因素: - 交通特性 - 觀光遊憩 - 天氣 - 網路輿情 - 閉路電視 - 客運與臺鐵人潮 - 景點遊客人數 - 住宿率 - 天氣預報 - 關鍵字搜尋頻率 [人事行政局104年行事曆](https://www.319papago.idv.tw/holiday/2015-hr/2015_HR.html) <br> <hr> <br> ## [[Kaggle] Air Passengers](https://www.kaggle.com/rakannimer/air-passengers) > 每月搭機旅客人數 ### 資料分析 - ### [TimeSeries Analysis 📈A Complete Guide 📚](https://www.kaggle.com/andreshg/timeseries-analysis-a-complete-guide) - ### [Day 24:銷售量預測(2) -- 『時間序列分析』技巧篇](https://ithelp.ithome.com.tw/articles/10195635) - 航空公司每月乘客人數資料 ![](https://i.imgur.com/8uasF3r.png =50%x) - 1994~2013年台積電的每日股價資料 ![](https://i.imgur.com/5Lq1Dgj.png =50%x) <br> ### read_csv - ### 載入 CSV (預設參數) ```python= df = pandas.read_csv('air-passengers.csv') ``` ![](https://i.imgur.com/qVHpcu2.png) - 每年 7,8 月是航運最高峰 - 每年 11,12月或是 2月是航運最低峰 - 隨著每年航運量提昇,振幅也加大 - 100人的10%,就是 ±10 - 500人的10%,就是 ±50 - ### 載入 CSV,並只保留指定欄位 ```python= import pandas import matplotlib.pyplot as plt df = pandas.read_csv('air-passengers.csv', usecols=[1]) plt.plot(df) plt.show() ``` ![](https://i.imgur.com/J0EzAsQ.png) ![](https://i.imgur.com/NMd4d4h.png) - ### 載入 CSV,並將時間軸設為 raw label ```python= import pandas import matplotlib.pyplot as plt df = pandas.read_csv('air-passengers.csv', index_col='Month') plt.plot(df) plt.show() ``` ![](https://i.imgur.com/xECDhHi.png) ![](https://i.imgur.com/TTcKjJi.png) - ### 使用 1 階差分 ```python= import pandas import matplotlib.pyplot as plt df = pandas.read_csv('air-passengers.csv', usecols=[1]) # 1階差分 df1 = df.diff(1) plt.plot(df1) plt.show() ``` ![](https://i.imgur.com/c5gXMKV.png) - ### 繪製訓練資料&測試資料 ```python= import pandas df_train = pandas.read_csv('air-passengers-train.csv', index_col=[0]) df_test = pandas.read_csv('air-passengers-test.csv', index_col=[0]) # plot the training data plt.figure(figsize=(12, 5), dpi=100) plt.plot(df_train, label='train') # prepare the future data ## x values month = [df_train.index[-1]] month.extend(df_test.index) ## y values passengers = [df_train["#Passengers"][-1]] passengers.extend(df_test["#Passengers"]) df_future = pandas.DataFrame({ 'passengers': passengers }, index=month) plt.plot(df_future, label='test') plt.legend(loc='upper left') plt.xticks([str(y) + '-01' for y in range(1949, 1962, 1)]) plt.grid() plt.show() ``` [![](https://i.imgur.com/IrfvWqG.png)](https://i.imgur.com/IrfvWqG.png) <br> ### 預測 - ### [[LSTM] Time Series Prediction with LSTM Recurrent Neural Networks in Python with Keras](https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/) - LSTM Network for Regression - Train Score: 22.93 RMSE - Test Score: 47.53 RMSE - LSTM for Regression Using the Window Method - Train Score: 24.19 RMSE - Test Score: 58.03 RMSE - LSTM for Regression with Time Steps - Train Score: 23.69 RMSE - Test Score: 58.88 RMSE - LSTM with Memory Between Batches - Train Score: 20.74 RMSE - Test Score: 52.23 RMSE - Stacked LSTMs with Memory Between Batches - Train Score: 20.49 RMSE - Test Score: 56.35 RMSE <br> <hr> <br> ## [[Kaggle] Bike Sharing Demand](https://www.kaggle.com/c/bike-sharing-demand/data) <br> <hr> <br> ## [Kaggle] 天氣 ### [Tabular Playground Series - Jul 2021](https://www.kaggle.com/c/tabular-playground-series-jul-2021) > 根據溫度、濕度、感測器,預測空污值(一氧化碳、苯、氮氧化物) <br> <hr> <br> ## [Kaggle] 銷售額 ### [Sales Time Series Forecasting](https://www.kaggle.com/c/sales-time-series-forecasting-ca-afcs2020/overview) ### [Store Sales - Time Series Forecasting](https://www.kaggle.com/c/store-sales-time-series-forecasting/overview) ### [M5 Forecasting - Accuracy](https://www.kaggle.com/c/m5-forecasting-accuracy) > 預測未來 28 天的每日銷售額 ### [M5 Forecasting - Uncertainty](https://www.kaggle.com/c/m5-forecasting-uncertainty) >預測未來 28 天的每日銷售額,並為這些預測做出不確定性估計 ### [Walmart Recruiting - Store Sales Forecasting](https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/overview) ### [Rossmann Store Sales](https://www.kaggle.com/c/rossmann-store-sales) <br> <hr> <br> ## [Kaggle] 時間-項目 數據 ### [Web Traffic Time Series Forecasting](https://www.kaggle.com/c/web-traffic-time-series-forecasting/overview) > 預測大約 145,000 篇維基百科文章的網絡流量(每日瀏覽量) <br> <hr> <br> ## [[Azure] 自行車共用服務的出租需求](https://docs.microsoft.com/zh-tw/azure/machine-learning/tutorial-automated-ml-forecast) - MachineLearningNotebooks - how-to-use-azureml - automated-machine-learning - [forecasting-bike-share](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/automated-machine-learning/forecasting-bike-share) - [bike-no.csv](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/automated-machine-learning/forecasting-bike-share/bike-no.csv) - 欄位說明 - [Python Spark MLlib 之決策樹迴歸分析](https://www.itread01.com/content/1544348007.html) > 均方誤差RMSE=117.804043648 - [[Python嗯~機器學習]---用python3來分析共享單車投放量](https://www.twblogs.net/a/5c3f47cabd9eee35b3a66981) > R-squared: 0.39423906942549125 > RMSE: 142.08580203044002 - [Final Project - Data analysis and visualization in R](https://rstudio-pubs-static.s3.amazonaws.com/98994_613e4a13f448452c937233f146f80d59.html) - 專有名詞 - [讓我們看雲去](http://mail.atm.ncu.edu.tw/~hong/atmhmpg/clouds/ahead.htm) - 天空小於十分之一的面積有雲時,稱為晴朗(Clear, CLR) - 雲覆蓋天空十分之一到十分之五時,稱為疏雲(Scattered, SCT, partly cloudy) - 天空的十分之六到十分之九覆蓋有雲時,稱為裂雲(Broken, BRN, cloudy - 天空超過十分之九覆蓋著雲時,稱為陰天 (Overcast, OVC) - [scattered cloud - 疏雲 - 國家教育研究院雙語詞彙](https://terms.naer.edu.tw/detail/986615/) - scattered cloud 疏雲 <br> ## [[Azure] 60 萬筆任意 Web 服務流量的記錄](https://docs.microsoft.com/zh-tw/azure/data-explorer/time-series-analysis) | TimeStamp | BrowserVer | OsVer | Country | |--------------------------|------------------------|------------|----------------| | 2016-08-21T00:00:10.625Z | Chrome 51.0 | Windows 10 | United States | | 2016-08-21T00:00:31.295Z | Internet Explorer 11.0 | Windows 10 | United States | | 2016-08-21T00:00:42.879Z | Chrome 52.0 | Windows 10 | United States | | 2016-08-21T00:00:47.017Z | Chrome 51.0 | Windows 10 | United States | | 2016-08-21T00:01:06.642Z | Chrome 52.0 | Windows 7 | United Kingdom | | 2016-08-21T00:02:02.893Z | Internet Explorer 11.0 | Windows 7 | | | 2016-08-21T00:02:59.909Z | Chrome 52.0 | Windows 7 | United Kingdom | | 2016-08-21T00:03:03.315Z | Firefox 48.0 | Windows 10 | United States | | 2016-08-21T00:03:07.263Z | Firefox 48.0 | Windows 10 | United States | | 2016-08-21T00:03:13.471Z | Internet Explorer 11.0 | Windows 7 | | [![](https://i.imgur.com/Ajf7R2M.png)](https://i.imgur.com/Ajf7R2M.png) <br> <hr> <br> ## [[UCI] Air Quality Data Set](https://archive.ics.uci.edu/ml/datasets/Air+Quality) > 資料來源:[A Multivariate Time Series Guide to Forecasting and Modeling (with Python codes)](https://www.analyticsvidhya.com/blog/2018/09/multivariate-time-series-guide-forecasting-modeling-python-codes/) > #多變量時間序列 > #`from statsmodels.tsa.vector_ar.var_model import VAR` ![](https://i.imgur.com/VFIcVey.png) | Date | Time | CO(GT) | PT08.S1(CO) | NMHC(GT) | C6H6(GT) | PT08.S2(NMHC) | NOx(GT) | PT08.S3(NOx) | NO2(GT) | PT08.S4(NO2) | PT08.S5(O3) | T | RH | AH | |-----------|----------|--------|-------------|----------|----------|---------------|---------|--------------|---------|--------------|-------------|------|------|--------| | 3/10/2004 | 18:00:00 | 2.6 | 1360 | 150 | 11.9 | 1046 | 166 | 1056 | 113 | 1692 | 1268 | 13.6 | 48.9 | 0.7578 | | 3/10/2004 | 19:00:00 | 2 | 1292 | 112 | 9.4 | 955 | 103 | 1174 | 92 | 1559 | 972 | 13.3 | 47.7 | 0.7255 | | 3/10/2004 | 20:00:00 | 2.2 | 1402 | 88 | 9.0 | 939 | 131 | 1140 | 114 | 1555 | 1074 | 11.9 | 54.0 | 0.7502 | | 3/10/2004 | 21:00:00 | 2.2 | 1376 | 80 | 9.2 | 948 | 172 | 1092 | 122 | 1584 | 1203 | 11.0 | 60.0 | 0.7867 | | 3/10/2004 | 22:00:00 | 1.6 | 1272 | 51 | 6.5 | 836 | 131 | 1205 | 116 | 1490 | 1110 | 11.2 | 59.6 | 0.7888 | | 3/10/2004 | 23:00:00 | 1.2 | 1197 | 38 | 4.7 | 750 | 89 | 1337 | 96 | 1393 | 949 | 11.2 | 59.2 | 0.7848 | | 3/11/2004 | 0:00:00 | 1.2 | 1185 | 31 | 3.6 | 690 | 62 | 1462 | 77 | 1333 | 733 | 11.3 | 56.8 | 0.7603 | | 3/11/2004 | 1:00:00 | 1 | 1136 | 31 | 3.3 | 672 | 62 | 1453 | 76 | 1333 | 730 | 10.7 | 60.0 | 0.7702 | | 3/11/2004 | 2:00:00 | 0.9 | 1094 | 24 | 2.3 | 609 | 45 | 1579 | 60 | 1276 | 620 | 10.7 | 59.7 | 0.7648 | <br> <hr> <br> ## [tensorflow] The weather dataset > 資訊來源:[Time series forecasting](https://www.tensorflow.org/tutorials/structured_data/time_series) > 資料下載:https://storage.googleapis.com/tensorflow/tf-keras-datasets/jena_climate_2009_2016.csv.zip <br> ### [欄位說明](https://keras.io/examples/timeseries/timeseries_weather_forecasting/) | Index | Features| Format| Description| |-------|---------|-------|------------| | 1| Date Time | 01.01.2009 00:10:00 | Date-time reference| | 2| p (mbar) | 996.52| The pascal SI derived unit of pressure used to quantify internal pressure. Meteorological reports typically state atmospheric pressure in millibars. | | 3| T (degC) | -8.02 | Temperature in Celsius | | 4| Tpot (K) | 265.4 | Temperature in Kelvin| | 5| Tdew (degC) | -8.9| Temperature in Celsius relative to humidity. Dew Point is a measure of the absolute amount of water in the air, the DP is the temperature at which the air cannot hold all the moisture in it and water condenses. | | 6| rh (%) | 93.3| Relative Humidity is a measure of how saturated the air is with water vapor, the %RH determines the amount of water contained within collection objects. | | 7| VPmax (mbar) | 3.33| Saturation vapor pressure| | 8| VPact (mbar) | 3.11| Vapor pressure | | 9| VPdef (mbar) | 0.22| Vapor pressure deficit | | 10| sh (g/kg) | 1.94| Specific humidity| | 11| H2OC (mmol/mol) | 3.12| Water vapor concentration| | 12| rho (g/m ** 3) | 1307.75 | Airtight | | 13| wv (m/s) | 1.03| Wind speed | | 14| max. wv (m/s) | 1.75| Maximum wind speed | | 15| wd (deg) | 152.3 | Wind direction in degrees| <br> ### read_csv ```python= import os import pandas as pd zip_path = tf.keras.utils.get_file( origin='https://storage.googleapis.com/tensorflow/tf-keras-datasets/jena_climate_2009_2016.csv.zip', fname='jena_climate_2009_2016.csv.zip', extract=True) csv_path, _ = os.path.splitext(zip_path) df = pd.read_csv(csv_path) df ``` ![](https://i.imgur.com/FRQBv33.png) <br> ### 預測 - [[tensorflow] Time series forecasting](https://www.tensorflow.org/tutorials/structured_data/time_series) - [[Keras] Timeseries forecasting for weather prediction](https://keras.io/examples/timeseries/timeseries_weather_forecasting/) > 欄位說明 - [Python 深度學習之循環神經網絡的高級用法](https://pythonmana.com/2021/08/20210816191316709q.html) > 《Deep Learning with Python》(第二版,François Chollet 著) <br> <hr> <br> ## [Shampoo Sales Dataset](https://machinelearningmastery.com/persistence-time-series-forecasting-with-python/) ![](https://i.imgur.com/KtsNzgs.png) <br> <hr> <br> ## FEDOT - [FEDOT/cases/data/](https://github.com/nccr-itmo/FEDOT/tree/master/cases/data) - classification - regression - time-series - [Fedot time-series forecasting electricity case](https://github.com/ITMO-NSS-team/fedot_electro_ts_case) (風力發電 + 燃煤發電) - [FEDOT Timeseries benchmarking](https://github.com/ITMO-NSS-team/Fedot-TS-Benchmark) - fred_case - river_level_forecasting - sensors - stock_price <br> <hr> <br> ## [Sktime](https://pypi.org/project/sktime/) ### [[doc] datasets](https://www.sktime.org/en/stable/api_reference/datasets.html) - `dir(sktime.datasets)` ```= 'load_PBS_dataset', 'load_UCR_UEA_dataset', 'load_acsf1', 'load_airline', 'load_arrow_head', 'load_basic_motions', 'load_electric_devices_segmentation', 'load_gun_point_segmentation', 'load_gunpoint', 'load_italy_power_demand', 'load_japanese_vowels', 'load_longley', 'load_lynx', 'load_macroeconomic', 'load_osuleaf', 'load_shampoo_sales', 'load_unit_test', 'load_uschange' ``` ### load_airline ```python= import pandas y = sktime.datasets.load_airline() # pandas.core.series.Series pandas.DataFrame(y, index=y.index) ``` ![](https://i.imgur.com/d9ougyP.png) <br> <hr> <br> ## 模擬時間序列資料 > #人工生成的時間序列 artificially generated time series > #合成的時間序列 synthetic time series ### 來源資料 - ### [Time series forecasting with FEDOT. Guide](https://github.com/ITMO-NSS-team/fedot-examples/blob/main/notebooks/latest/3_intro_ts_forecasting.ipynb) > artificially generated time series <br> ### 資料迭代 > sin + (cos1 + cos2) + random_noise - ### sin ![](https://i.imgur.com/PH2Mx6k.png) - ### cos = cos1 + cos2 ![](https://i.imgur.com/TdxRs7U.png) ![](https://i.imgur.com/esYH6in.png =48%x) ![](https://i.imgur.com/lMZdRmX.png =48%x) - ### sin + cos ![](https://i.imgur.com/eyefoNZ.png) ![](https://i.imgur.com/m0sgiKT.png) - ### random_noise ![](https://i.imgur.com/gru9lkF.png) - ### sin + cos + random_noise ![](https://i.imgur.com/GjzHiYM.png) ### 程式碼 ```python= import numpy as np import matplotlib.pyplot as plt def generate_synthetic_data(length: int = 2000, periods: int = 10): """ The function generates a synthetic univariate time series :param length: the length of the array (even number) :param periods: the number of periods in the sine wave :return synthetic_data: an array without gaps """ # First component sinusoidal_data = np.linspace(-periods * np.pi, periods * np.pi, length) sinusoidal_data = np.sin(sinusoidal_data) # Second component cos_1_data = np.linspace(-periods * np.pi/2, periods/2 * np.pi/2, int(length/2)) cos_1_data = np.cos(cos_1_data) cos_2_data = np.linspace(periods/2 * np.pi/2, periods * np.pi/2, int(length/2)) cos_2_data = np.cos(cos_2_data) cosine_data = np.hstack((cos_1_data, cos_2_data)) random_noise = np.random.normal(loc=0.0, scale=0.1, size=length) # Combining a sine wave, cos wave and random noise synthetic_data = sinusoidal_data + cosine_data + random_noise return synthetic_data # Get such numpy array synthetic_time_series = generate_synthetic_data() # We will predict 100 values in the future len_forecast = 100 # Let's dividide our data on train and test samples train_data = synthetic_time_series[:-len_forecast] test_data = synthetic_time_series[-len_forecast:] # Plot time series plt.figure(figsize=(12, 5), dpi=100) plt.plot(np.arange(0, len(train_data)), train_data, label = 'Train') plt.plot(np.arange(len(train_data), len(train_data)+len(test_data)), test_data, label = 'Test') plt.ylabel('Parameter', fontsize = 15) plt.xlabel('Time index', fontsize = 15) plt.legend(fontsize = 15) plt.title('Synthetic time series') plt.grid() plt.show() ``` [![](https://i.imgur.com/TpTvzoS.png)](https://i.imgur.com/TpTvzoS.png) - 若要匯出為 csv (不帶時間戳記) ```python= import pandas df = pandas.DataFrame(train_data, columns=['value']) df.to_csv("synthetic_time_series_train.csv") df = pandas.DataFrame(test_data, columns=['value']) df.to_csv("synthetic_time_series_test.csv") ``` - 若要匯出為 csv (帶時間戳記) ```python= import pandas df = pandas.DataFrame(train_data, columns=['value']) index = [] y = 2000 m = 1 for i in range(len(df)): index.append('%d-%02d' % (y, m)) m += 1 if m > 12: y += 1 m = 1 df.index = index df.to_csv("synthetic_time_series_train.csv") # -------------------------------------------------- df = pandas.DataFrame(test_data, columns=['value']) index = [] for i in range(len(df)): index.append('%d-%02d' % (y, m)) m += 1 if m > 12: y += 1 m = 1 df.index = index df.to_csv("synthetic_time_series_test.csv") ```