[ML] 2. 資料集 dataset
===
> [[ML / Category]](/hu2F_DDbSI6nHzO2NbJ_Jg)
###### tags: `ML / Category`
###### tags: `ML`, `資料集`, `dataset`, `sklearn`, `python`
<br>
[TOC]
<br>
# pandas 的一些補充用法
> [[hackmd] Python / pandas](/G8fBsOCnQvideryUGNZ_9A)
## 基本 dataframe 操作
- [pandas: Get the number of rows, columns, all elements (size) of DataFrame](https://note.nkmk.me/en/python-pandas-len-shape-size/)
- [Python | Pandas dataframe.insert()](https://www.geeksforgeeks.org/python-pandas-dataframe-insert/)
### [預設欄位型態 & 預設值](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html)
```python
df = pd.DataFrame({'float': [1.0],
'int': [1],
'datetime': [pd.Timestamp('20180310')],
'string': ['foo']})
df.dtypes
#float float64
#int int64
#datetime datetime64[ns]
#string object
#dtype: object
```
### 取得資料總筆數(總列數)
```
len(df)
```
或是
```
df.shape[0]
```
### 取得欄位個數(總行數)
```python
len(df.columns)
```
或是
```python
df.shape[1]
```
### 取得最後一欄資料
```
df[df.columns[-1]]
```
```
df.iloc[:, len(df.columns) - 1]
```
### 取得元素個數
```python
df.size
```
或是
```python
df.shape[0]*df.shape[1]
```
### 更新整欄資料
```python
df.iloc[:, 0] = '#' + df.iloc[:, 0]
df
```
```python
df['sex'] = '#' + df['sex']
df
```
### 刪除某一欄
```python=
df.drop(axis='columns', columns=['species'], inplace=True)
```
- 參考資料
- [pandas.DataFrame.drop](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html)
```python
df.drop(df.columns[i], axis=1)
#or
df.drop(axis=1, columns=df.columns[i])
```
- 參考資料
- [python dataframe pandas drop column using int](https://stackoverflow.com/questions/20297317)
### 新增一欄或一列
```python=
import pandas
df = pandas.DataFrame(columns=['c1', 'c2', 'c3'])
#df.iloc[0] = [1,2,3] # failed
df.loc[0] = [1,2,3]
df.loc[1] = [5,6,7]
display(df)
df['c4'] = [4,8]
display(df)
df.insert(2, 'c2.5', [2.5, 6.7])
df
```

- 參考資料
- [Adding new column to existing DataFrame in Pandas](https://www.geeksforgeeks.org/adding-new-column-to-existing-dataframe-in-pandas/)
- [Create pandas Dataframe by appending one row at a time](https://stackoverflow.com/questions/10715965/)
- [pandas.DataFrame.loc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html)
- [pandas.DataFrame.iloc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html)
> .iloc will raise IndexError if a requested indexer is out-of-bounds,
## [No show data analysis](https://www.kaggle.com/khaledelhasafy/no-show-data-anlysis)
### 檢視
- ```.shape```
- 顯示資料的列、欄數量
- ```.head(10)```

- ```.tail(10)```
- ```.info()```

- ```.memory_usage()```
### 統計
- ```.describe()```
數量、平均值、標準差、最大與最小、第一/第二(中位數)/第三分位數

- ```.hist(figsize=(10,10))```
顯示欄位概貌

### 分析
- ```.duplicated()```
檢查資料是否有重複
- ```.groupby(['欄位1名稱', '欄位2名稱'], as_index=False)```
分群統計
- ```.value_counts(['Gender', 'SMS_received'], normalize=True)```
## [No show data analysis 2](https://www.kaggle.com/somrikbanerjee/predicting-show-up-no-show)
<br>
<hr>
<br>
# 自定義資料集
## Regression 回歸
### 10-simple-xy-dataset
> 資料來源:9789864341405-Python機器學習, Chapter10, page289
#### python code
```python=
X = [258.0, 270.0, 294.0, 320.0, 342.0, 368.0, 396.0, 446.0, 480.0, 586.0]
y = [236.4, 234.4, 252.8, 298.6, 314.2, 342.2, 360.8, 368.0, 391.2, 390.8]
import numpy as np
import matplotlib.pyplot as plt
# simple x/y dataset
X = np.array(X)
X = X[:, np.newaxis]
y = np.array(y)
plt.scatter(X, y, label='training points')
plt.legend(loc='lower right')
plt.show()
```

<br>
<hr>
<br>
# sklearn (scikit-learn)
## dataset 說明
### 回傳 X, y 形式
```python=
import sklearn
import pandas
X, y = sklearn.datasets.load_iris(return_X_y = True)
print('X:')
print(X[0:10])
print('y:')
print(y)
```
執行結果:
```
X:
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]
[5.4 3.9 1.7 0.4]
[4.6 3.4 1.4 0.3]
[5. 3.4 1.5 0.2]
[4.4 2.9 1.4 0.2]
[4.9 3.1 1.5 0.1]]
y:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
```
<br>
### 回傳 資料結構 形式
```python=
import sklearn
iris = sklearn.datasets.load_iris()
dataset = iris
print('dir(dataset):\n', dir(dataset), '\n', sep='')
print('.filename:\n', dataset.filename, '\n', sep='')
print('.feature_names:\n', dataset.feature_names, '\n', sep='')
print('.data[0:10]:\n', dataset.data[0:10], '\n', sep='')
print('.target_names:\n', dataset.target_names, '\n', sep='')
print('.target:\n', dataset.target, '\n', sep='')
print('.frame:\n', dataset.frame, '\n', sep='')
```
執行結果:
```
dir(iris):
['DESCR', 'data', 'feature_names', 'filename', 'frame', 'target', 'target_names']
.filename:
/home/diatango_lin/tj_tsai/software/anaconda3/envs/rapidgenomics/lib/python3.7/site-packages/sklearn/datasets/data/iris.csv
.feature_names:
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
.data[0:10]:
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]
[5.4 3.9 1.7 0.4]
[4.6 3.4 1.4 0.3]
[5. 3.4 1.5 0.2]
[4.4 2.9 1.4 0.2]
[4.9 3.1 1.5 0.1]]
.target_names:
['setosa' 'versicolor' 'virginica']
.target:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
.frame:
None
```
<br>
### Pretty Printing a pandas dataframe (漂亮地输出 pandas 資料框架)
```python=
import sklearn
iris = sklearn.datasets.load_iris()
import pandas
from IPython.display import display
display(pandas.DataFrame(iris.data))
```
執行結果:

參考資料
- [Display the Pandas DataFrame in table style](https://www.geeksforgeeks.org/display-the-pandas-dataframe-in-table-style/)
<br>
### 轉完整的 DataFrame
- `site-packages/sklearn/datasets/data/iris.csv`

- to_dataframe()
```python=
import sklearn.datasets, pandas, numpy
from IPython.display import display
iris = sklearn.datasets.load_iris()
df = pandas.read_csv(
iris.filename, skiprows=1, header=None,
names=numpy.append(iris.feature_names, ['target']))
display(df)
```

<br>
## 分類資料集
### iris
- code
```
import sklearn
iris = sklearn.datasets.load_iris()
```
- ### .feature_names
- 'sepal length (cm)' (萼片長度)
- 'sepal width (cm)' (萼片寬度)
- 'petal length (cm)' (花瓣長度)
- 'petal width (cm)' (花瓣寬度)
- ### .target_names:
['setosa', 'versicolor, 'virginica']
['山鳶尾', '變色鳶尾', '維吉尼亞鴛尾']
- ### dataframe

- ### 參考資料
- [範例資料:鴛尾花資料集(iris data set)](https://hackmd.io/@mutolisp/SyowFbuAb?type=view)

<br>

<br>
### breast_cancer
- code
```
import sklearn
cancer = sklearn.datasets.load_breast_cancer()
```
- ### .feature_names
- 'mean radius' (平均半徑)
- 'mean texture' (平均紋理)
- 'mean perimeter' (平均周長)
- 'mean area' (平均面積)
- 'mean smoothness' (平均平滑度)
- 'mean compactness' (平均緊湊度)
- 'mean concavity' (平均凹度)
- 'mean concave points' (平均凹點)
- 'mean symmetry' (平均對稱)
- 'mean fractal dimension' (平均分形維數)
- 'radius error' (半徑錯誤)
- 'texture error' (紋理錯誤)
- 'perimeter error' (周邊錯誤)
- 'area error' (區域錯誤)
- 'smoothness error' (平滑度錯誤)
- 'compactness error' (緊湊度錯誤)
- 'concavity error' (凹面錯誤)
- 'concave points error' (凹點錯誤)
- 'symmetry error' (對稱錯誤)
- 'fractal dimension error' (分形誤差)
- 'worst radius' (最差半徑)
- 'worst texture' (最差的紋理)
- 'worst perimeter' (最差的距離)
- 'worst area' (最差區域)
- 'worst smoothness' (最差的平滑度)
- 'worst compactness' (最差的緊湊度)
- 'worst concavity' (最差的凹面)
- 'worst concave points' (最糟糕的凹點)
- 'worst symmetry' (最糟糕的對稱性)
- 'worst fractal dimension' (最差的分形維數)
- ### .target_names:
['malignant' 'benign']
['惡性''良性']
- ### dataframe
[](https://i.imgur.com/TvzMAYm.png)
<br>
### wine
- code: dataset
```python
import sklearn
wine = sklearn.datasets.load_wine()
```
- code: dataset / df_xy
```python
from sklearn.datasets import load_wine
import pandas
wine = load_wine()
df = pandas.DataFrame(wine.data, columns=wine.feature_names)
df['target'] = wine.target
df
```
- ### .feature_names
- 'alcohol'
- 'malic_acid'
- 'ash'
- 'alcalinity_of_ash'
- 'magnesium'
- 'total_phenols'
- 'flavanoids'
- 'nonflavanoid_phenols'
- 'proanthocyanins'
- 'color_intensity'
- 'hue'
- 'od280/od315_of_diluted_wines'
- 'proline'
- ### .target_names:
['class_0' 'class_1' 'class_2']
- ### dataframe

<br>
## 迴歸資料集
### boston
- ### dataset
- News
- [關於波士頓住房數據集你不知道的事](https://towardsdatascience.com/things-you-didnt-know-about-the-boston-housing-dataset-2e87a6f960e8)
- [StatLib---Datasets Archive](http://lib.stat.cmu.edu/datasets/)
- [boston](http://lib.stat.cmu.edu/datasets/boston)
```
Variables in order:
CRIM per capita crime rate by town
ZN proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS proportion of non-retail business acres per town
CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX nitric oxides concentration (parts per 10 million)
RM average number of rooms per dwelling
AGE proportion of owner-occupied units built prior to 1940
DIS weighted distances to five Boston employment centres
RAD index of accessibility to radial highways
TAX full-value property-tax rate per $10,000
PTRATIO pupil-teacher ratio by town
B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT % lower status of the population
MEDV Median value of owner-occupied homes in $1000's
```
- ### code1
```python=
import warnings
from sklearn.datasets import load_boston
import pandas as pd
warnings.filterwarnings("ignore")
boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['MEDV'] = boston.target
df
```

- ### code2
```python=
from sklearn.datasets import load_boston
import pandas as pd
boston = load_boston()
df = pd.read_csv(boston.filename, skiprows=1, header=0)
df
```
- ### [code3: 官方程式碼](https://scikit-learn.org/dev/modules/generated/sklearn.datasets.load_boston.html)
```python=
import warnings
from sklearn.datasets import load_boston
with warnings.catch_warnings():
# You should probably not use this dataset.
warnings.filterwarnings("ignore")
X, y = load_boston(return_X_y=True)
print(X.shape) #(506, 13)
```
官方修正聲明
```python=
import pandas as pd
import numpy as np
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]
```
- ### misc
- ### [[2021 iThome 鐵人賽] 全民瘋AI系列2.0 系列](https://ithelp.ithome.com.tw/users/20107247/ironman/4723)
- [[Day 16] 每個模型我全都要 - 堆疊法 (Stacking)](https://ithelp.ithome.com.tw/m/articles/10274009)
- R2
- 訓練集 Score: 0.9608703782891547
- 測試集 Score: 0.9371735287625855
- MSE
- 訓練集 MSE: 3.389581229598408
- 測試集 MSE: 3.9225215768179433
<br>
## 其他資料集
- load_diabetes
- 操作範例:[Automatic Logging](https://www.mlflow.org/docs/latest/tracking.html#automatic-logging)
- load_digits
- load_files
- load_linnerud
- load_sample_image
- load_sample_images
- load_svmlight_file
- load_svmlight_files
<br>
## 隨機產生
- [Python sklearn RandomForestClassifier non-reproducible results](https://stackoverflow.com/questions/47433920)
```python
X, y = make_classification(n_samples=1000, n_features=4,
n_informative=2, n_redundant=0,
random_state=0, shuffle=False)
```
- [13 Ensemble Learning (集體學習)](https://laoweizz.blogspot.com/2018/12/ensemble.html)
```python=
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
X, y = make_classification(n_samples=1000, n_features=100, n_informative=20,
n_clusters_per_class=3, random_state=11)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=11)
print(X_test.shape)
clf = DecisionTreeClassifier(random_state=11)
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
print(classification_report(y_test, predictions))
```
<br>
<hr>
<br>
# [Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets.php)
> UCI 機器學習資料庫
<br>
<hr>
## Classification 分類
### [Iris 鳶尾花](https://archive.ics.uci.edu/ml/datasets/Iris)

- [資料欄位說明](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.names)
| 欄位索引 | 欄位說明(en) | 欄位說明(cht) |
| ------ | ------------------ | ----------- |
| 1 | sepal length in cm | 花萼長(公分) |
| 2 | sepal width in cm | 花萼寬(公分) |
| 3 | petal length in cm | 花瓣長(公分) |
| 4 | petal width in cm | 花瓣寬(公分) |
| 5 | class<br>- Iris Setosa<br>- Iris Versicolour<br>- Iris Virginica | 類別<br>- 山鳶尾<br>- 變色鳶尾<br>- 維吉尼亞鳶尾 |
- [資料集下載](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data)
#### python code / 資料讀取
```python=
import pandas as pd
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=None)
df.tail()
```



#### python code / 建立輸出入資料
```python=
import pandas as pd
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=None)
X = df.iloc[:, 0:4].values
y = df.iloc[:, 4].values
# -1: Iris-setosa
# 0: Iris-versicolor
# 1: Iris-virginica
y = np.where(y == 'Iris-setosa', -1, np.where(y == 'Iris-versicolor', 0, 1))
import matplotlib.pyplot as plt
plt.scatter(X[ 0: 50, 0], X[ 0: 50, 2], marker='o', color='blue', label='setosa')
plt.scatter(X[ 50:100, 0], X[ 50:100, 2], marker='x', color='red', label='versicolor')
plt.scatter(X[100:150, 0], X[100:150, 2], marker='o', color='green', label='virginica')
plt.xlabel('sepal length (cm)')
plt.ylabel('petal length (cm)')
plt.legend(loc='lower right')
plt.show()
```

#### [seabon](https://docs.microsoft.com/zh-tw/azure/databricks/notebooks/visualizations/#seaborn)
```python=
import seaborn as sns
sns.set(style="white")
df = sns.load_dataset("iris")
g = sns.PairGrid(df, diag_sharey=True)
g.map_lower(sns.kdeplot)
g.map_diag(sns.kdeplot, lw=3)
g.map_upper(sns.regplot)
display(g.fig)
```

<br>
### [Medical Appointment No Shows](https://www.kaggle.com/joniarroba/noshowappointments/discussion/39322)
[](https://i.imgur.com/FZ6WBzc.png)



- 測試摘要
- ### random-forest:
- train: 99.99%
- predict: 78.82%
- ### auto-sklearn:
- 1 hour
- train: 87.23%
- predict: 80.45%
- 20 hour
- train: 83.65%
- Best validation score: 0.797441
- predict: 80.75%
<br>
### [Adult Data Set](https://archive.ics.uci.edu/ml/datasets/adult) (binary)

- ### [Tabular Data Example](https://towardsdatascience.com/autogluon-deep-learning-automl-5cdb4e2388ec#065f)
- data
| data | shape | download link |
| ---- | ----- | ------------- |
| train | 39073 x 15 | https://autogluon.s3.amazonaws.com/datasets/AdultIncomeBinaryClassification/train_data.csv |
| test | 9769 x 15 | https://autogluon.s3.amazonaws.com/datasets/AdultIncomeBinaryClassification/test_data.csv |
- features
1. age
2. workclass
3. fnlwgt
4. education
5. education-num
6. marital-status
7. occupation
8. relationship
9. race
10. sex
11. capital-gain
12. capital-loss
13. hours-per-week
14. native-country
- target
- class
- `<=50K`
- `>50K`
- models
[](https://i.imgur.com/o8f3L9v.png)
- metrics
- AutoGluon (GBM+NN_TORCH)
- accuracy: 0.8752175248234211
- balanced_accuracy: 0.7985774242740231
- mcc: 0.6384055943366135
- roc_auc: 0.9292811684599376
- f1: 0.7128386336866903
- precision: 0.785158277114686
- recall: 0.6527178602243313
- AutoGluon (`auto_stack=False`)
- accuracy: 0.8763435356740711
- AutoGluon (`auto_stack=True`)
- accuracy: 0.8760364418057119
- AutoSklearn
- accuracy: 0.873170232367694 (24H@twcc)
<br>
<hr>
## Regression 回歸
### [不動產估價](https://archive.ics.uci.edu/ml/datasets/Real+estate+valuation+data+set)

- 資料欄位說明
- Input:
X1=the transaction date (for example, 2013.250=2013 March, 2013.500=2013 June, etc.)
X2=the house age (unit: year)
X3=the distance to the nearest MRT station (unit: meter)
X4=the number of convenience stores in the living circle on foot (integer)
X5=the geographic coordinate, latitude. (unit: degree)
X6=the geographic coordinate, longitude. (unit: degree)
- Output:
Y= house price of unit area (10000 New Taiwan Dollar/Ping,
where Ping is a local unit, 1 Ping = 3.3 meter squared)
- [資料集下載](https://archive.ics.uci.edu/ml/machine-learning-databases/00477/Real%20estate%20valuation%20data%20set.xlsx)
<br>
### 房屋數據集
- [資料集下載](https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data) (空白間隔, 506列x14行)
- 資料欄位說明


- Input:
- 欄位:1^st^ (CRIM) ~ 13^th^ (LSTAT)
- Output:
- 欄位:14^th^ (MEDV)
#### python code
```python=
import pandas as pd
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data'
df = pd.read_csv(url, header=None, delim_whitespace=True)
df.columns = [
'CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX',
'RM', 'AGE', 'DIS', 'RAD', 'TAX',
'PTRATIO', 'B', 'LSTAT', 'MEDV']
# debug
df.columns
df.iloc[0]
df.iloc[0][0] # =0.00632
df.iloc[0]['CRIM'] # =0.00632
df.head()
# -----
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid', context='notebook')
cols = ['LSTAT', 'INDUS', 'NOX', 'RM', 'MEDV']
sns.pairplot(df[cols], height=2.5)
plt.show()
# -----
import numpy as np
cm = np.corrcoef(df[cols].values.T)
sns.set(font_scale=1.5)
hm = sns.heatmap(
cm, cbar=True, annot=True, square=True,
fmt='.2f', annot_kws={'size': 15},
yticklabels=cols, xticklabels=cols)
plt.show()
```


<br>
### [House Prices - Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview)
<br>
<hr>
<br>
# Kaggle (醫療相關)
## [Medical Cost Personal Datasets](https://www.kaggle.com/mirichoi0218/insurance)
- Insurance Forecast by using Linear Regression
使用線性回歸的保險預測
- 任務: Regression (預估保險費用)
- 資料筆數: 1338 筆
- 資料欄位: age,sex,bmi,children,smoker,region,charges
## [Medical Appointment No Shows](https://www.kaggle.com/joniarroba/noshowappointments)
- Why do 30% of patients miss their scheduled appointments?
為什麼30%的患者錯過了預定的看診日?
- 任務: Classification (判斷預約掛診後是否會來就醫)
- 資料筆數: 110527 筆
- 資料欄位: PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age... 13 + 1 個
- Hipertension 高血壓
- Diabetes 糖尿病
- Alcoholism 酒精中毒
- ### 資料樣貌
```python
import pandas_profiling
pandas_profiling.ProfileReport(df)
```

- ### [實驗結果](https://hackmd.io/wgExODglS_iltzcWYApEpA)
- acc 最高是 80.8%
- SGD
- BernoulliNB
- SVM / SVC
- MLP
## [Pima Indians Diabetes Database](https://www.kaggle.com/uciml/pima-indians-diabetes-database)
- Predict the onset of diabetes based on diagnostic measures
根據診斷措施預測糖尿病的發作
- 任務: Classification (是否患有糖尿病)
- 資料筆數: 768 筆
- 資料欄位: Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
## [Cardiovascular Disease dataset](https://www.kaggle.com/sulianova/cardiovascular-disease-dataset)
- 任務: Classification (有無心血管疾病)
- 資料筆數: 70000 筆
- 資料欄位: id;age;gender;height;weight;ap_hi;ap_lo;cholesterol;gluc;smoke;alco;active;cardio
## [Heart Failure Prediction](https://www.kaggle.com/andrewmvd/heart-failure-clinical-data)
- 任務: Classification (判斷是否心臟衰竭)
- 資料筆數: 299 筆
- 資料欄位: age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure..等 12+1個
## [Heart Disease UCI](https://www.kaggle.com/ronitf/heart-disease-uci)
- 任務: Classification (判斷是否心臟疾病)
- 資料筆數: 303 筆
- 資料欄位: age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
<br>
# Kaggle (非醫療相關)
## [Predicting Term Deposit Subscriptions (預測定期存款訂閱)](https://www.kaggle.com/janiobachmann/bank-marketing-dataset/discussion)
- 總共 11162 筆數據, 每筆數據有 17 個欄位
Integer x 7
String x 6
Boolean x 4
## Palmer Archipelago (Antarctica) penguin data
- ### [[Kaggle] Palmer Archipelago (Antarctica) penguin data](https://www.kaggle.com/parulpandey/palmer-archipelago-antarctica-penguin-data)
- ### [[github] allisonhorst / palmerpenguins](https://github.com/allisonhorst/palmerpenguins/blob/master/README.md)

- [data-raw/penguins.R](https://github.com/allisonhorst/palmerpenguins/blob/master/data-raw/penguins.R)
- 有提到資料集來源
- Adelie penguin data from: https://doi.org/10.6073/pasta/abc50eed9138b75f54eaada0841b9b86
- Gentoo penguin data from: https://doi.org/10.6073/pasta/2b1cff60f81640f182433d23e68541ce
- Chinstrap penguin data from: https://doi.org/10.6073/pasta/409c808f8fc9899d02401bdb04580af7
- 見底下 EDI Data Portal
- ### [EDI Data Portal](https://portal.edirepository.org/nis/home.jsp) (最新資料集)
- **阿德利企鵝(Adelie penguin)資料:**
https://doi.org/10.6073/pasta/98b16d7d563f265cb52372c8ca99e60f
版本: knb-lter-pal.219.5 (2020-06-08)
- **巴布亞企鵝(Gentoo penguin)資料:**
https://doi.org/10.6073/pasta/9fc8f9b5a2fa28bdca96516649b6599b
版本:knb-lter-pal.220.7 (2020-10-24)
- **南極企鵝(Chinstrap penguin)資料:**
https://doi.org/10.6073/pasta/ce9b4713bb8c065a8fcfd7f50bf30dde
版本:knb-lter-pal.221.8 (2020-10-24)
- ### 欄位說明
- [https://allisonhorst.github.io/palmerpenguins/reference/penguins_raw.html](https://allisonhorst.github.io/palmerpenguins/reference/penguins_raw.html)
- EDI Data Portal 的 「View Full Metadata」亦有欄位說明
- 例如 Adelie 企鵝 [Full Metadata](https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-pal.219.5)
<br>

- 資料分析
- [penguin dataset : The new Iris](https://www.kaggle.com/parulpandey/penguin-dataset-the-new-iris)
## [House Prices - Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data) (房價預測)
- 訓練集:1460筆x81欄位
- 測試集:1459x80欄位
- ### 解析
- [3.16. 实战Kaggle比赛:房价预测](https://zh.d2l.ai/chapter_deep-learning-basics/kaggle-house-price.html) :+1: :100:
- ### Google AutoML Tables
- ### 相關性 (對照 AzureML,這個應該是 皮耳森相互關聯)

- ### Azure AutoML
- ### 遺漏特徵值插補


- ### 高基數特徵偵測


- Id, LotFrontage, GarageYrBlt

- ### 特徵重要度
- top features
- **GrLivArea**: Above grade (ground) living area square feet
- **GarageCars**: Size of garage in car capacity
- **TotalBsmtSF**: Total square feet of basement area
- **ExterQual**: Evaluates the quality of the material on the exterior
- **YearBuilt**: Original construction date
- StackEnsemble

- VotingEnsemble

- XGBoostRegressor

- ### 設計工具:[統計分析關聯性:皮耳森相互關聯(Pearson correlation)](https://docs.microsoft.com/zh-tw/azure/machine-learning/algorithm-module-reference/filter-based-feature-selection)

```yaml
[
{
"SalePrice": 1.0,
"OverallQual": 0.790981600583805,
"GrLivArea": 0.7086244776126522,
"GarageCars": 0.6404091972583521,
"GarageArea": 0.6234314389183618,
"TotalBsmtSF": 0.6135805515591953,
"1stFlrSF": 0.6058521846919146,
"FullBath": 0.560663762748446,
"TotRmsAbvGrd": 0.5337231555820281,
"YearBuilt": 0.5228973328794969,
"YearRemodAdd": 0.5071009671113861,
"MasVnrArea": 0.4774930470957156,
"Fireplaces": 0.4669288367515278,
"BsmtFinSF1": 0.38641980624215333,
"WoodDeckSF": 0.3244134445681299,
"2ndFlrSF": 0.31933380283206775,
"OpenPorchSF": 0.31585622711605527,
"HalfBath": 0.28410767559478245,
"LotArea": 0.26384335387140573,
"CentralAir": 0.251328163840155,
"BsmtFullBath": 0.22712223313149427,
"BsmtUnfSF": 0.21447910554696886,
"BedroomAbvGr": 0.16821315430074,
"KitchenAbvGr": 0.13590737084214122,
"EnclosedPorch": 0.12857795792595672,
"ScreenPorch": 0.1114465711429112,
"PoolArea": 0.09240354949187321,
"MSSubClass": 0.08428413512659517,
"OverallCond": 0.07785589404867801,
"MoSold": 0.046432245223819335,
"3SsnPorch": 0.04458366533574842,
"YrSold": 0.028922585168730326,
"LowQualFinSF": 0.025606130000679548,
"Id": 0.021916719443431102,
"MiscVal": 0.021189579640303255,
"BsmtHalfBath": 0.01684415429735901,
"BsmtFinSF2": 0.011378121450215153,
"KitchenQual": 0.0,
"Functional": 0.0,
"FireplaceQu": 0.0,
"GarageType": 0.0,
"GarageYrBlt": 0.0,
"GarageFinish": 0.0,
"GarageQual": 0.0,
"GarageCond": 0.0,
"PavedDrive": 0.0,
"PoolQC": 0.0,
"Fence": 0.0,
"MiscFeature": 0.0,
"SaleType": 0.0,
"HeatingQC": 0.0,
"Electrical": 0.0,
"Heating": 0.0,
"MSZoning": 0.0,
"LotFrontage": 0.0,
"Street": 0.0,
"Alley": 0.0,
"LotShape": 0.0,
"LandContour": 0.0,
"Utilities": 0.0,
"LotConfig": 0.0,
"LandSlope": 0.0,
"Neighborhood": 0.0,
"Condition1": 0.0,
"Condition2": 0.0,
"BldgType": 0.0,
"HouseStyle": 0.0,
"RoofStyle": 0.0,
"RoofMatl": 0.0,
"Exterior1st": 0.0,
"Exterior2nd": 0.0,
"MasVnrType": 0.0,
"ExterQual": 0.0,
"ExterCond": 0.0,
"Foundation": 0.0,
"BsmtQual": 0.0,
"BsmtCond": 0.0,
"BsmtExposure": 0.0,
"BsmtFinType1": 0.0,
"BsmtFinType2": 0.0,
"SaleCondition": 0.0
}
]
```
- ### 設計工具:[統計分析關聯性:卡方平方(Chi squared)](https://docs.microsoft.com/zh-tw/azure/machine-learning/algorithm-module-reference/filter-based-feature-selection)

```json=
[
{
"SalePrice": 1.0,
"LotFrontage": 3011.821838830588,
"GrLivArea": 2684.988619593563,
"OverallQual": 2076.4370435123515,
"2ndFlrSF": 1729.135623005293,
"Neighborhood": 1718.9416421931924,
"GarageYrBlt": 1444.01487442007,
"MasVnrArea": 1232.978646691799,
"TotalBsmtSF": 1208.6424283049907,
"GarageCars": 1154.0818931305969,
"GarageArea": 1117.3938558971226,
"BsmtQual": 1089.2201890097551,
"ExterQual": 1042.6353408934951,
"1stFlrSF": 1039.0358146067674,
"KitchenQual": 986.0213093615508,
"YearBuilt": 844.8493742946195,
"FullBath": 834.075902967989,
"TotRmsAbvGrd": 812.9894565705961,
"PoolArea": 761.8265722853805,
"GarageFinish": 707.8632871546307,
"FireplaceQu": 678.7779866327791,
"BsmtFinSF1": 662.9655270176974,
"YearRemodAdd": 661.1373017171591,
"MSSubClass": 660.5744268968961,
"GarageType": 634.2167951745353,
"Foundation": 564.3189858452354,
"Exterior2nd": 513.098610732704,
"BsmtFinType1": 490.98687575736005,
"LotArea": 489.7292604702702,
"OpenPorchSF": 476.0364376643361,
"Exterior1st": 470.05636643687404,
"Fireplaces": 452.174469195772,
"BsmtUnfSF": 403.7488934069761,
"PoolQC": 374.3439140033527,
"HeatingQC": 367.62791254066514,
"WoodDeckSF": 359.9288380271235,
"OverallCond": 339.3392846787617,
"MasVnrType": 336.06740869693033,
"MSZoning": 318.4825084183071,
"BsmtExposure": 313.85916169813373,
"GarageQual": 304.205287648579,
"SaleType": 272.2658167083576,
"SaleCondition": 266.56895904550186,
"GarageCond": 264.01574101164795,
"HouseStyle": 260.6223484176385,
"BedroomAbvGr": 213.11479134627658,
"CentralAir": 202.04060045537705,
"HalfBath": 188.09435622298125,
"LotShape": 184.24822870560422,
"BsmtCond": 176.7307642825022,
"RoofMatl": 173.2228239075082,
"PavedDrive": 166.95685885309413,
"BsmtFinType2": 160.1892640411418,
"Electrical": 157.74469500376358,
"ScreenPorch": 156.6847308357881,
"EnclosedPorch": 147.07828586418742,
"RoofStyle": 132.39898289223495,
"Condition1": 119.59281692687587,
"Fence": 118.35962619971271,
"BsmtFullBath": 111.4363605643106,
"BldgType": 92.76996588414389,
"MoSold": 88.41172079015419,
"BsmtFinSF2": 82.90237961824074,
"Functional": 82.05618468076621,
"LotConfig": 77.14912520796912,
"Id": 72.06877839487606,
"LowQualFinSF": 71.6470236213088,
"Alley": 69.71665915090168,
"ExterCond": 63.42406887735416,
"LandContour": 61.538330919443055,
"Heating": 60.68961833983211,
"KitchenAbvGr": 55.78543944013338,
"3SsnPorch": 51.077139267999875,
"Condition2": 41.64240250802226,
"YrSold": 37.35295246655157,
"MiscVal": 28.367347205758172,
"BsmtHalfBath": 27.5360756465552,
"MiscFeature": 19.55442424515907,
"LandSlope": 19.518483799644102,
"Street": 5.84078739669776,
"Utilities": 2.017911056480437
}
]
```
- ### House Prices (房價預測) 評分結果
| Platform | Model | 自評:Std RMSE | Kaggle:RMSE | Rank |
| -------- | -------- | -------- | -------- | -------- |
| ==<span style="white-space: nowrap;">ASUS AI Maker</span>== | AutoRegrssor | 15013.36<br>(non-std) | 0.12742 :+1: | 1909 / 8200 |
| Google Vertex AI | - | 23,895.473<br>(non-std) | 0.13608 | 4333 / 10815 <br>(2021/05/24) |
| Azure AutoML | StackEnsemble | 0.03519 | 0.13771 | 3459 / 8200 |
| Azure AutoML | MaxAbsScaler,<br>LightGBM | 0.03753 | 0.13841 | 3537 / 8200 |
| Azure AutoML | VotingEnsemble | 25239(non-std) / 0.03505 | 0.13876 | 3607 / 8200 |
| Azure AutoML | MaxAbsScaler,<br>XGBoostRegressor | 0.03879 | 0.14746 | 4732 / 8200 |
| Google AutoML (改版) | - | 41445.203<br>(non-std) | 0.15973 | 7454 / 10810<br>(2021/05/24) |
| Azure AutoML | MaxAbsScaler,<br>RandomForest | 0.04158 | 0.16989 | 6230 / 8200 |
| Google AutoML (old) | :warning: <span style="color: red">cannot handle<br>unknown labels</span> | - | - | - |
## [Titanic - Machine Learning from Disaster](https://www.kaggle.com/c/titanic/data)
> ### 鐵達尼號
### 相關下載點
- [pandas-videos / data / titanic_train.csv](https://github.com/justmarkham/pandas-videos/blob/master/data/titanic_train.csv)
- 短網址:http://bit.ly/kaggletrain ([資訊來源](https://leemeng.tw/practical-pandas-tutorial-for-aspiring-data-scientists.html))
- [michhar / titanic.csv](https://gist.github.com/michhar/2dfd2de0d4f8727f873422c5d959fff5)
- [plotly / datasets](https://github.com/plotly/datasets/blob/master/titanic.csv)
## Wine Quality
> [[HackMD] Red Wine Quality](https://hackmd.io/GYh8h8zBT3ucNF1w09r1AQ)
### 相關下載點
- https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009
- https://www.kaggle.com/datasets/rajyellow46/wine-quality
- https://www.kaggle.com/datasets/yasserh/wine-quality-dataset
- https://www.kaggle.com/datasets/fedesoriano/spanish-wine-quality-dataset
<br>
<hr>
<br>
# Azure 資料集範例
## [開放資料集總覽](https://docs.microsoft.com/zh-tw/azure/open-datasets/dataset-catalog#AzureNotebooks)
### Transportation
### 健全狀況和 genomics
### 人力和經濟效益
### 人口和安全性
### 補充和一般資料集
| 資料集 | Description |
|-------|-------------|
| 糖尿病 | 糖尿病資料集有 442 份具有 10 項特徵的範例,因此很適合作為機器學習演算法入門。|
| OJ 銷售模擬資料 | 此資料集衍生自 Dominick 的 OJ 資料集,並包含額外的模擬資料,其目標是提供資料集,讓您可以輕鬆地在 Azure Machine Learning 上訓練數以千計的模型。 |
| 手寫數位的 MNIST 資料庫 | 手寫數字的 MNIST 資料庫有 60,000 個範例的訓練集,以及 10,000 個範例的測試集。 數字已大小正規化且在固定大小的影像置中。|
| Microsoft 新聞建議資料集 | Microsoft 新聞資料集 (主意) 是新聞建議研究的大規模資料集。 它可作為新聞建議的基準資料集,並可協助研究新聞建議和推薦系統。 |
| **公共假日** | 來自 PyPI 假日套件和 Wikipedia 的全球國定假日資料,涵蓋 1970 年至 2099 年的 38 個國家或地區。 |
| 俄文開啟語音轉換文字 | 俄文 Open STT 是適用于俄文語言的大型開放語音轉換文字資料集 |
<br>
## 迴歸資料
- ### [Azure Machine Learning 設計工具的範例管線和資料集](https://docs.microsoft.com/zh-tw/azure/machine-learning/samples-designer)
| 任務類型 | 範例標題 | 描述 |
| -------- | -------- | -------- |
| 迴歸 | [紐約市計程車資料](https://docs.microsoft.com/zh-tw/azure/machine-learning/tutorial-auto-train-models) | 使用 AutoML / regression |
| 迴歸 | [汽車價格預測 (基本)](https://github.com/Azure/MachineLearningDesigner/blob/master/articles/samples/regression-automobile-price-prediction-basic.md) | 使用線性回歸來預測汽車價格。 |
| 迴歸 | [汽車價格預測 (進階)](https://github.com/Azure/MachineLearningDesigner/blob/master/articles/samples/regression-automobile-price-prediction-compare-algorithms.md) | 使用決策樹系和推進式決策樹迴歸輸入變數來預測汽車價格。 比較這兩個模型找出最佳的演算法。 |
<br>
## 時間序列資料
> #data #dataset #time-series #ml #train #predict #forcast
>
- ### 每小時能源需求
> https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/nyc_energy.csv
> timeStamp時間戳記, demand能源需求, precip降雨, temp溫度
> 每小時的能源需求和基本天氣資料。
- 資料來源:
[[doc] 組態設定](https://docs.microsoft.com/zh-tw/azure/machine-learning/how-to-auto-train-forecast#configuration-settings)

- [auto-ml-forecasting-energy-demand.ipynb](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/automated-machine-learning/forecasting-energy-demand/auto-ml-forecasting-energy-demand.ipynb)

- [data from New York City](http://mis.nyiso.com/public/P-58Blist.htm)
<br>
<hr>
<br>
# 資料集擴增
- 2021/06/19 - [Facebook AI Open Sources AugLy: A New Python Library For Data Augmentation To Develop Robust Machine Learning Models](https://www.marktechpost.com/2021/06/19/facebook-ai-open-sources-augly-a-new-python-library-for-data-augmentation-to-develop-robust-machine-learning-models/)
- AugLy is a new open-source data augmentation library that combines audio, image, video, and text, becoming increasingly significant in several AI research fields.
- Sample
[](https://i.imgur.com/BtQ6KpQ.png)
[](https://i.imgur.com/Kww9u7p.jpg)
<br>
<hr>
<br>
# 其他資料集
## 分類
### 信用卡盜刷偵測
- ### [[2021 iThome 鐵人賽] 全民瘋AI系列2.0 系列](https://ithelp.ithome.com.tw/users/20107247/ironman/4723)
- [[Day 17] 輕量化的梯度提升機 - LightGBM](https://ithelp.ithome.com.tw/m/articles/10274577)
- dataset
https://storage.googleapis.com/download.tensorflow.org/data/creditcard.csv
<br>
## 迴歸
<br>
## 時間序列
<br>
<hr>
<br>
# 他人見解
- [這樣用資料才能幫企業解決問題](https://aiacademy.tw/how-companies-use-data/)
[](https://i.imgur.com/3LBMup9.png)
<hr>
> **資料不等於價值,用對地方才有價值**
>
> 台灣人工智慧學校執行長陳昇瑋在《人工智慧在台灣》一書中曾提醒,**許多企業會以為蒐集的資料本身就具有價值**,但事實上,資料必須經過處理、分析及開發才會變成**最終產品**,可能是一份分析報告、一個特定決策的建議。
>
> 換句話說,資料若沒有經過妥善的「加工處理」和「萃取分析」,本身的價值是尚未被開發與決定的,而使用它的人,也必須有能力將「對的資料」以「對的方式」應用在「對的場景」。
