時間序列資料集
===
###### tags: `ML / 時間序列`
###### tags: `ML`, `時間序列`, `資料集`
<br>
[TOC]
<br>
## 出生人口預估
- [台灣出生死亡人口數 & 六十年趨勢](https://morrisjfwong.com/blog/台灣出生死亡人口數-六十年趨勢/)

- [國發會人口圖表](https://ct.org.tw/html/news/3-3.php?cat=10&article=1354805)

<br>
<hr>
<br>
## 高速公路車流量
- [高公局分析90億筆ETC大資料,預測高速公路何時塞車](https://www.ithome.com.tw/article/98069)

其他因素:
- 交通特性
- 觀光遊憩
- 天氣
- 網路輿情
- 閉路電視
- 客運與臺鐵人潮
- 景點遊客人數
- 住宿率
- 天氣預報
- 關鍵字搜尋頻率
[人事行政局104年行事曆](https://www.319papago.idv.tw/holiday/2015-hr/2015_HR.html)
<br>
<hr>
<br>
## [[Kaggle] Air Passengers](https://www.kaggle.com/rakannimer/air-passengers)
> 每月搭機旅客人數
### 資料分析
- ### [TimeSeries Analysis 📈A Complete Guide 📚](https://www.kaggle.com/andreshg/timeseries-analysis-a-complete-guide)
- ### [Day 24:銷售量預測(2) -- 『時間序列分析』技巧篇](https://ithelp.ithome.com.tw/articles/10195635)
- 航空公司每月乘客人數資料

- 1994~2013年台積電的每日股價資料

<br>
### read_csv
- ### 載入 CSV (預設參數)
```python=
df = pandas.read_csv('air-passengers.csv')
```

- 每年 7,8 月是航運最高峰
- 每年 11,12月或是 2月是航運最低峰
- 隨著每年航運量提昇,振幅也加大
- 100人的10%,就是 ±10
- 500人的10%,就是 ±50
- ### 載入 CSV,並只保留指定欄位
```python=
import pandas
import matplotlib.pyplot as plt
df = pandas.read_csv('air-passengers.csv', usecols=[1])
plt.plot(df)
plt.show()
```
 
- ### 載入 CSV,並將時間軸設為 raw label
```python=
import pandas
import matplotlib.pyplot as plt
df = pandas.read_csv('air-passengers.csv', index_col='Month')
plt.plot(df)
plt.show()
```
 
- ### 使用 1 階差分
```python=
import pandas
import matplotlib.pyplot as plt
df = pandas.read_csv('air-passengers.csv', usecols=[1])
# 1階差分
df1 = df.diff(1)
plt.plot(df1)
plt.show()
```

- ### 繪製訓練資料&測試資料
```python=
import pandas
df_train = pandas.read_csv('air-passengers-train.csv', index_col=[0])
df_test = pandas.read_csv('air-passengers-test.csv', index_col=[0])
# plot the training data
plt.figure(figsize=(12, 5), dpi=100)
plt.plot(df_train, label='train')
# prepare the future data
## x values
month = [df_train.index[-1]]
month.extend(df_test.index)
## y values
passengers = [df_train["#Passengers"][-1]]
passengers.extend(df_test["#Passengers"])
df_future = pandas.DataFrame({
'passengers': passengers
}, index=month)
plt.plot(df_future, label='test')
plt.legend(loc='upper left')
plt.xticks([str(y) + '-01' for y in range(1949, 1962, 1)])
plt.grid()
plt.show()
```
[](https://i.imgur.com/IrfvWqG.png)
<br>
### 預測
- ### [[LSTM] Time Series Prediction with LSTM Recurrent Neural Networks in Python with Keras](https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/)
- LSTM Network for Regression
- Train Score: 22.93 RMSE
- Test Score: 47.53 RMSE
- LSTM for Regression Using the Window Method
- Train Score: 24.19 RMSE
- Test Score: 58.03 RMSE
- LSTM for Regression with Time Steps
- Train Score: 23.69 RMSE
- Test Score: 58.88 RMSE
- LSTM with Memory Between Batches
- Train Score: 20.74 RMSE
- Test Score: 52.23 RMSE
- Stacked LSTMs with Memory Between Batches
- Train Score: 20.49 RMSE
- Test Score: 56.35 RMSE
<br>
<hr>
<br>
## [[Kaggle] Bike Sharing Demand](https://www.kaggle.com/c/bike-sharing-demand/data)
<br>
<hr>
<br>
## [Kaggle] 天氣
### [Tabular Playground Series - Jul 2021](https://www.kaggle.com/c/tabular-playground-series-jul-2021)
> 根據溫度、濕度、感測器,預測空污值(一氧化碳、苯、氮氧化物)
<br>
<hr>
<br>
## [Kaggle] 銷售額
### [Sales Time Series Forecasting](https://www.kaggle.com/c/sales-time-series-forecasting-ca-afcs2020/overview)
### [Store Sales - Time Series Forecasting](https://www.kaggle.com/c/store-sales-time-series-forecasting/overview)
### [M5 Forecasting - Accuracy](https://www.kaggle.com/c/m5-forecasting-accuracy)
> 預測未來 28 天的每日銷售額
### [M5 Forecasting - Uncertainty](https://www.kaggle.com/c/m5-forecasting-uncertainty)
>預測未來 28 天的每日銷售額,並為這些預測做出不確定性估計
### [Walmart Recruiting - Store Sales Forecasting](https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting/overview)
### [Rossmann Store Sales](https://www.kaggle.com/c/rossmann-store-sales)
<br>
<hr>
<br>
## [Kaggle] 時間-項目 數據
### [Web Traffic Time Series Forecasting](https://www.kaggle.com/c/web-traffic-time-series-forecasting/overview)
> 預測大約 145,000 篇維基百科文章的網絡流量(每日瀏覽量)
<br>
<hr>
<br>
## [[Azure] 自行車共用服務的出租需求](https://docs.microsoft.com/zh-tw/azure/machine-learning/tutorial-automated-ml-forecast)
- MachineLearningNotebooks
- how-to-use-azureml
- automated-machine-learning
- [forecasting-bike-share](https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/automated-machine-learning/forecasting-bike-share)
- [bike-no.csv](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/automated-machine-learning/forecasting-bike-share/bike-no.csv)
- 欄位說明
- [Python Spark MLlib 之決策樹迴歸分析](https://www.itread01.com/content/1544348007.html)
> 均方誤差RMSE=117.804043648
- [[Python嗯~機器學習]---用python3來分析共享單車投放量](https://www.twblogs.net/a/5c3f47cabd9eee35b3a66981)
> R-squared: 0.39423906942549125
> RMSE: 142.08580203044002
- [Final Project - Data analysis and visualization in R](https://rstudio-pubs-static.s3.amazonaws.com/98994_613e4a13f448452c937233f146f80d59.html)
- 專有名詞
- [讓我們看雲去](http://mail.atm.ncu.edu.tw/~hong/atmhmpg/clouds/ahead.htm)
- 天空小於十分之一的面積有雲時,稱為晴朗(Clear, CLR)
- 雲覆蓋天空十分之一到十分之五時,稱為疏雲(Scattered, SCT, partly cloudy)
- 天空的十分之六到十分之九覆蓋有雲時,稱為裂雲(Broken, BRN, cloudy
- 天空超過十分之九覆蓋著雲時,稱為陰天 (Overcast, OVC)
- [scattered cloud - 疏雲 - 國家教育研究院雙語詞彙](https://terms.naer.edu.tw/detail/986615/)
- scattered cloud 疏雲
<br>
## [[Azure] 60 萬筆任意 Web 服務流量的記錄](https://docs.microsoft.com/zh-tw/azure/data-explorer/time-series-analysis)
| TimeStamp | BrowserVer | OsVer | Country |
|--------------------------|------------------------|------------|----------------|
| 2016-08-21T00:00:10.625Z | Chrome 51.0 | Windows 10 | United States |
| 2016-08-21T00:00:31.295Z | Internet Explorer 11.0 | Windows 10 | United States |
| 2016-08-21T00:00:42.879Z | Chrome 52.0 | Windows 10 | United States |
| 2016-08-21T00:00:47.017Z | Chrome 51.0 | Windows 10 | United States |
| 2016-08-21T00:01:06.642Z | Chrome 52.0 | Windows 7 | United Kingdom |
| 2016-08-21T00:02:02.893Z | Internet Explorer 11.0 | Windows 7 | |
| 2016-08-21T00:02:59.909Z | Chrome 52.0 | Windows 7 | United Kingdom |
| 2016-08-21T00:03:03.315Z | Firefox 48.0 | Windows 10 | United States |
| 2016-08-21T00:03:07.263Z | Firefox 48.0 | Windows 10 | United States |
| 2016-08-21T00:03:13.471Z | Internet Explorer 11.0 | Windows 7 | |
[](https://i.imgur.com/Ajf7R2M.png)
<br>
<hr>
<br>
## [[UCI] Air Quality Data Set](https://archive.ics.uci.edu/ml/datasets/Air+Quality)
> 資料來源:[A Multivariate Time Series Guide to Forecasting and Modeling (with Python codes)](https://www.analyticsvidhya.com/blog/2018/09/multivariate-time-series-guide-forecasting-modeling-python-codes/)
> #多變量時間序列
> #`from statsmodels.tsa.vector_ar.var_model import VAR`

| Date | Time | CO(GT) | PT08.S1(CO) | NMHC(GT) | C6H6(GT) | PT08.S2(NMHC) | NOx(GT) | PT08.S3(NOx) | NO2(GT) | PT08.S4(NO2) | PT08.S5(O3) | T | RH | AH |
|-----------|----------|--------|-------------|----------|----------|---------------|---------|--------------|---------|--------------|-------------|------|------|--------|
| 3/10/2004 | 18:00:00 | 2.6 | 1360 | 150 | 11.9 | 1046 | 166 | 1056 | 113 | 1692 | 1268 | 13.6 | 48.9 | 0.7578 |
| 3/10/2004 | 19:00:00 | 2 | 1292 | 112 | 9.4 | 955 | 103 | 1174 | 92 | 1559 | 972 | 13.3 | 47.7 | 0.7255 |
| 3/10/2004 | 20:00:00 | 2.2 | 1402 | 88 | 9.0 | 939 | 131 | 1140 | 114 | 1555 | 1074 | 11.9 | 54.0 | 0.7502 |
| 3/10/2004 | 21:00:00 | 2.2 | 1376 | 80 | 9.2 | 948 | 172 | 1092 | 122 | 1584 | 1203 | 11.0 | 60.0 | 0.7867 |
| 3/10/2004 | 22:00:00 | 1.6 | 1272 | 51 | 6.5 | 836 | 131 | 1205 | 116 | 1490 | 1110 | 11.2 | 59.6 | 0.7888 |
| 3/10/2004 | 23:00:00 | 1.2 | 1197 | 38 | 4.7 | 750 | 89 | 1337 | 96 | 1393 | 949 | 11.2 | 59.2 | 0.7848 |
| 3/11/2004 | 0:00:00 | 1.2 | 1185 | 31 | 3.6 | 690 | 62 | 1462 | 77 | 1333 | 733 | 11.3 | 56.8 | 0.7603 |
| 3/11/2004 | 1:00:00 | 1 | 1136 | 31 | 3.3 | 672 | 62 | 1453 | 76 | 1333 | 730 | 10.7 | 60.0 | 0.7702 |
| 3/11/2004 | 2:00:00 | 0.9 | 1094 | 24 | 2.3 | 609 | 45 | 1579 | 60 | 1276 | 620 | 10.7 | 59.7 | 0.7648 |
<br>
<hr>
<br>
## [tensorflow] The weather dataset
> 資訊來源:[Time series forecasting](https://www.tensorflow.org/tutorials/structured_data/time_series)
> 資料下載:https://storage.googleapis.com/tensorflow/tf-keras-datasets/jena_climate_2009_2016.csv.zip
<br>
### [欄位說明](https://keras.io/examples/timeseries/timeseries_weather_forecasting/)
| Index | Features| Format| Description|
|-------|---------|-------|------------|
| 1| Date Time | 01.01.2009 00:10:00 | Date-time reference|
| 2| p (mbar) | 996.52| The pascal SI derived unit of pressure used to quantify internal pressure. Meteorological reports typically state atmospheric pressure in millibars. |
| 3| T (degC) | -8.02 | Temperature in Celsius |
| 4| Tpot (K) | 265.4 | Temperature in Kelvin|
| 5| Tdew (degC) | -8.9| Temperature in Celsius relative to humidity. Dew Point is a measure of the absolute amount of water in the air, the DP is the temperature at which the air cannot hold all the moisture in it and water condenses. |
| 6| rh (%) | 93.3| Relative Humidity is a measure of how saturated the air is with water vapor, the %RH determines the amount of water contained within collection objects. |
| 7| VPmax (mbar) | 3.33| Saturation vapor pressure|
| 8| VPact (mbar) | 3.11| Vapor pressure |
| 9| VPdef (mbar) | 0.22| Vapor pressure deficit |
| 10| sh (g/kg) | 1.94| Specific humidity|
| 11| H2OC (mmol/mol) | 3.12| Water vapor concentration|
| 12| rho (g/m ** 3) | 1307.75 | Airtight |
| 13| wv (m/s) | 1.03| Wind speed |
| 14| max. wv (m/s) | 1.75| Maximum wind speed |
| 15| wd (deg) | 152.3 | Wind direction in degrees|
<br>
### read_csv
```python=
import os
import pandas as pd
zip_path = tf.keras.utils.get_file(
origin='https://storage.googleapis.com/tensorflow/tf-keras-datasets/jena_climate_2009_2016.csv.zip',
fname='jena_climate_2009_2016.csv.zip',
extract=True)
csv_path, _ = os.path.splitext(zip_path)
df = pd.read_csv(csv_path)
df
```

<br>
### 預測
- [[tensorflow] Time series forecasting](https://www.tensorflow.org/tutorials/structured_data/time_series)
- [[Keras] Timeseries forecasting for weather prediction](https://keras.io/examples/timeseries/timeseries_weather_forecasting/)
> 欄位說明
- [Python 深度學習之循環神經網絡的高級用法](https://pythonmana.com/2021/08/20210816191316709q.html)
> 《Deep Learning with Python》(第二版,François Chollet 著)
<br>
<hr>
<br>
## [Shampoo Sales Dataset](https://machinelearningmastery.com/persistence-time-series-forecasting-with-python/)

<br>
<hr>
<br>
## FEDOT
- [FEDOT/cases/data/](https://github.com/nccr-itmo/FEDOT/tree/master/cases/data)
- classification
- regression
- time-series
- [Fedot time-series forecasting electricity case](https://github.com/ITMO-NSS-team/fedot_electro_ts_case) (風力發電 + 燃煤發電)
- [FEDOT Timeseries benchmarking](https://github.com/ITMO-NSS-team/Fedot-TS-Benchmark)
- fred_case
- river_level_forecasting
- sensors
- stock_price
<br>
<hr>
<br>
## [Sktime](https://pypi.org/project/sktime/)
### [[doc] datasets](https://www.sktime.org/en/stable/api_reference/datasets.html)
- `dir(sktime.datasets)`
```=
'load_PBS_dataset',
'load_UCR_UEA_dataset',
'load_acsf1',
'load_airline',
'load_arrow_head',
'load_basic_motions',
'load_electric_devices_segmentation',
'load_gun_point_segmentation',
'load_gunpoint',
'load_italy_power_demand',
'load_japanese_vowels',
'load_longley',
'load_lynx',
'load_macroeconomic',
'load_osuleaf',
'load_shampoo_sales',
'load_unit_test',
'load_uschange'
```
### load_airline
```python=
import pandas
y = sktime.datasets.load_airline() # pandas.core.series.Series
pandas.DataFrame(y, index=y.index)
```

<br>
<hr>
<br>
## 模擬時間序列資料
> #人工生成的時間序列 artificially generated time series
> #合成的時間序列 synthetic time series
### 來源資料
- ### [Time series forecasting with FEDOT. Guide](https://github.com/ITMO-NSS-team/fedot-examples/blob/main/notebooks/latest/3_intro_ts_forecasting.ipynb)
> artificially generated time series
<br>
### 資料迭代
> sin + (cos1 + cos2) + random_noise
- ### sin

- ### cos = cos1 + cos2

 
- ### sin + cos


- ### random_noise

- ### sin + cos + random_noise

### 程式碼
```python=
import numpy as np
import matplotlib.pyplot as plt
def generate_synthetic_data(length: int = 2000, periods: int = 10):
"""
The function generates a synthetic univariate time series
:param length: the length of the array (even number)
:param periods: the number of periods in the sine wave
:return synthetic_data: an array without gaps
"""
# First component
sinusoidal_data = np.linspace(-periods * np.pi, periods * np.pi, length)
sinusoidal_data = np.sin(sinusoidal_data)
# Second component
cos_1_data = np.linspace(-periods * np.pi/2, periods/2 * np.pi/2, int(length/2))
cos_1_data = np.cos(cos_1_data)
cos_2_data = np.linspace(periods/2 * np.pi/2, periods * np.pi/2, int(length/2))
cos_2_data = np.cos(cos_2_data)
cosine_data = np.hstack((cos_1_data, cos_2_data))
random_noise = np.random.normal(loc=0.0, scale=0.1, size=length)
# Combining a sine wave, cos wave and random noise
synthetic_data = sinusoidal_data + cosine_data + random_noise
return synthetic_data
# Get such numpy array
synthetic_time_series = generate_synthetic_data()
# We will predict 100 values in the future
len_forecast = 100
# Let's dividide our data on train and test samples
train_data = synthetic_time_series[:-len_forecast]
test_data = synthetic_time_series[-len_forecast:]
# Plot time series
plt.figure(figsize=(12, 5), dpi=100)
plt.plot(np.arange(0, len(train_data)), train_data, label = 'Train')
plt.plot(np.arange(len(train_data), len(train_data)+len(test_data)), test_data, label = 'Test')
plt.ylabel('Parameter', fontsize = 15)
plt.xlabel('Time index', fontsize = 15)
plt.legend(fontsize = 15)
plt.title('Synthetic time series')
plt.grid()
plt.show()
```
[](https://i.imgur.com/TpTvzoS.png)
- 若要匯出為 csv (不帶時間戳記)
```python=
import pandas
df = pandas.DataFrame(train_data, columns=['value'])
df.to_csv("synthetic_time_series_train.csv")
df = pandas.DataFrame(test_data, columns=['value'])
df.to_csv("synthetic_time_series_test.csv")
```
- 若要匯出為 csv (帶時間戳記)
```python=
import pandas
df = pandas.DataFrame(train_data, columns=['value'])
index = []
y = 2000
m = 1
for i in range(len(df)):
index.append('%d-%02d' % (y, m))
m += 1
if m > 12:
y += 1
m = 1
df.index = index
df.to_csv("synthetic_time_series_train.csv")
# --------------------------------------------------
df = pandas.DataFrame(test_data, columns=['value'])
index = []
for i in range(len(df)):
index.append('%d-%02d' % (y, m))
m += 1
if m > 12:
y += 1
m = 1
df.index = index
df.to_csv("synthetic_time_series_test.csv")
```