[ML] 2. 資料集 dataset

[ML] 2. 資料集 dataset === > [[ML / Category]](/hu2F_DDbSI6nHzO2NbJ_Jg) ###### tags: `ML / Category` ###### tags: `ML`, `資料集`, `dataset`, `sklearn`, `python` [TOC] # pandas 的一些補充用法 > [[hackmd] Python / pandas](/G8fBsOCnQvideryUGNZ_9A) ## 基本 dataframe 操作 - [pandas: Get the number of rows, columns, all elements (size) of DataFrame](https://note.nkmk.me/en/python-pandas-len-shape-size/) - [Python | Pandas dataframe.insert()](https://www.geeksforgeeks.org/python-pandas-dataframe-insert/) ### [預設欄位型態 & 預設值](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html) ```python df = pd.DataFrame({'float': [1.0], 'int': [1], 'datetime': [pd.Timestamp('20180310')], 'string': ['foo']}) df.dtypes #float float64 #int int64 #datetime datetime64[ns] #string object #dtype: object ``` ### 取得資料總筆數(總列數) ``` len(df) ``` 或是 ``` df.shape[0] ``` ### 取得欄位個數(總行數) ```python len(df.columns) ``` 或是 ```python df.shape[1] ``` ### 取得最後一欄資料 ``` df[df.columns[-1]] ``` ``` df.iloc[:, len(df.columns) - 1] ``` ### 取得元素個數 ```python df.size ``` 或是 ```python df.shape[0]*df.shape[1] ``` ### 更新整欄資料 ```python df.iloc[:, 0] = '#' + df.iloc[:, 0] df ``` ```python df['sex'] = '#' + df['sex'] df ``` ### 刪除某一欄 ```python= df.drop(axis='columns', columns=['species'], inplace=True) ``` - 參考資料 - [pandas.DataFrame.drop](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) ```python df.drop(df.columns[i], axis=1) #or df.drop(axis=1, columns=df.columns[i]) ``` - 參考資料 - [python dataframe pandas drop column using int](https://stackoverflow.com/questions/20297317) ### 新增一欄或一列 ```python= import pandas df = pandas.DataFrame(columns=['c1', 'c2', 'c3']) #df.iloc[0] = [1,2,3] # failed df.loc[0] = [1,2,3] df.loc[1] = [5,6,7] display(df) df['c4'] = [4,8] display(df) df.insert(2, 'c2.5', [2.5, 6.7]) df ``` ![](https://i.imgur.com/rWr2NJP.png) - 參考資料 - [Adding new column to existing DataFrame in Pandas](https://www.geeksforgeeks.org/adding-new-column-to-existing-dataframe-in-pandas/) - [Create pandas Dataframe by appending one row at a time](https://stackoverflow.com/questions/10715965/) - [pandas.DataFrame.loc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) - [pandas.DataFrame.iloc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) > .iloc will raise IndexError if a requested indexer is out-of-bounds, ## [No show data analysis](https://www.kaggle.com/khaledelhasafy/no-show-data-anlysis) ### 檢視 - ```.shape``` - 顯示資料的列、欄數量 - ```.head(10)``` ![](https://i.imgur.com/liXj96f.png) - ```.tail(10)``` - ```.info()``` ![](https://i.imgur.com/5KZSZ3p.png) - ```.memory_usage()``` ### 統計 - ```.describe()``` 數量、平均值、標準差、最大與最小、第一/第二(中位數)/第三分位數 ![](https://i.imgur.com/FukSjt0.png) - ```.hist(figsize=(10,10))``` 顯示欄位概貌 ![](https://i.imgur.com/jOX4AFi.png) ### 分析 - ```.duplicated()``` 檢查資料是否有重複 - ```.groupby(['欄位1名稱', '欄位2名稱'], as_index=False)``` 分群統計 - ```.value_counts(['Gender', 'SMS_received'], normalize=True)``` ## [No show data analysis 2](https://www.kaggle.com/somrikbanerjee/predicting-show-up-no-show) <hr> # 自定義資料集 ## Regression 回歸 ### 10-simple-xy-dataset > 資料來源：9789864341405-Python機器學習, Chapter10, page289 #### python code ```python= X = [258.0, 270.0, 294.0, 320.0, 342.0, 368.0, 396.0, 446.0, 480.0, 586.0] y = [236.4, 234.4, 252.8, 298.6, 314.2, 342.2, 360.8, 368.0, 391.2, 390.8] import numpy as np import matplotlib.pyplot as plt # simple x/y dataset X = np.array(X) X = X[:, np.newaxis] y = np.array(y) plt.scatter(X, y, label='training points') plt.legend(loc='lower right') plt.show() ``` ![](https://i.imgur.com/9e6WGAg.png) <hr> # sklearn (scikit-learn) ## dataset 說明 ### 回傳 X, y 形式 ```python= import sklearn import pandas X, y = sklearn.datasets.load_iris(return_X_y = True) print('X:') print(X[0:10]) print('y:') print(y) ``` 執行結果： ``` X: [[5.1 3.5 1.4 0.2] [4.9 3. 1.4 0.2] [4.7 3.2 1.3 0.2] [4.6 3.1 1.5 0.2] [5. 3.6 1.4 0.2] [5.4 3.9 1.7 0.4] [4.6 3.4 1.4 0.3] [5. 3.4 1.5 0.2] [4.4 2.9 1.4 0.2] [4.9 3.1 1.5 0.1]] y: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] ``` ### 回傳資料結構形式 ```python= import sklearn iris = sklearn.datasets.load_iris() dataset = iris print('dir(dataset):\n', dir(dataset), '\n', sep='') print('.filename:\n', dataset.filename, '\n', sep='') print('.feature_names:\n', dataset.feature_names, '\n', sep='') print('.data[0:10]:\n', dataset.data[0:10], '\n', sep='') print('.target_names:\n', dataset.target_names, '\n', sep='') print('.target:\n', dataset.target, '\n', sep='') print('.frame:\n', dataset.frame, '\n', sep='') ``` 執行結果： ``` dir(iris): ['DESCR', 'data', 'feature_names', 'filename', 'frame', 'target', 'target_names'] .filename: /home/diatango_lin/tj_tsai/software/anaconda3/envs/rapidgenomics/lib/python3.7/site-packages/sklearn/datasets/data/iris.csv .feature_names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] .data[0:10]: [[5.1 3.5 1.4 0.2] [4.9 3. 1.4 0.2] [4.7 3.2 1.3 0.2] [4.6 3.1 1.5 0.2] [5. 3.6 1.4 0.2] [5.4 3.9 1.7 0.4] [4.6 3.4 1.4 0.3] [5. 3.4 1.5 0.2] [4.4 2.9 1.4 0.2] [4.9 3.1 1.5 0.1]] .target_names: ['setosa' 'versicolor' 'virginica'] .target: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] .frame: None ``` ### Pretty Printing a pandas dataframe (漂亮地输出 pandas 資料框架) ```python= import sklearn iris = sklearn.datasets.load_iris() import pandas from IPython.display import display display(pandas.DataFrame(iris.data)) ``` 執行結果： ![](https://i.imgur.com/JYGdYt7.png) 參考資料 - [Display the Pandas DataFrame in table style](https://www.geeksforgeeks.org/display-the-pandas-dataframe-in-table-style/) ### 轉完整的 DataFrame - `site-packages/sklearn/datasets/data/iris.csv` ![](https://i.imgur.com/caQ33rI.png) - to_dataframe() ```python= import sklearn.datasets, pandas, numpy from IPython.display import display iris = sklearn.datasets.load_iris() df = pandas.read_csv( iris.filename, skiprows=1, header=None, names=numpy.append(iris.feature_names, ['target'])) display(df) ``` ![](https://i.imgur.com/VPz2Pih.png) ## 分類資料集 ### iris - code ``` import sklearn iris = sklearn.datasets.load_iris() ``` - ### .feature_names - 'sepal length (cm)' (萼片長度) - 'sepal width (cm)' (萼片寬度) - 'petal length (cm)' (花瓣長度) - 'petal width (cm)' (花瓣寬度) - ### .target_names: ['setosa', 'versicolor, 'virginica'] ['山鳶尾', '變色鳶尾', '維吉尼亞鴛尾'] - ### dataframe ![](https://i.imgur.com/HnV9vK0.png) - ### 參考資料 - [範例資料：鴛尾花資料集(iris data set)](https://hackmd.io/@mutolisp/SyowFbuAb?type=view) ![](https://i.imgur.com/WHaPRHH.jpg) ![](https://i.imgur.com/D3GcNNH.jpg) ### breast_cancer - code ``` import sklearn cancer = sklearn.datasets.load_breast_cancer() ``` - ### .feature_names - 'mean radius' (平均半徑) - 'mean texture' (平均紋理) - 'mean perimeter' (平均周長) - 'mean area' (平均面積) - 'mean smoothness' (平均平滑度) - 'mean compactness' (平均緊湊度) - 'mean concavity' (平均凹度) - 'mean concave points' (平均凹點) - 'mean symmetry' (平均對稱) - 'mean fractal dimension' (平均分形維數) - 'radius error' (半徑錯誤) - 'texture error' (紋理錯誤) - 'perimeter error' (周邊錯誤) - 'area error' (區域錯誤) - 'smoothness error' (平滑度錯誤) - 'compactness error' (緊湊度錯誤) - 'concavity error' (凹面錯誤) - 'concave points error' (凹點錯誤) - 'symmetry error' (對稱錯誤) - 'fractal dimension error' (分形誤差) - 'worst radius' (最差半徑) - 'worst texture' (最差的紋理) - 'worst perimeter' (最差的距離) - 'worst area' (最差區域) - 'worst smoothness' (最差的平滑度) - 'worst compactness' (最差的緊湊度) - 'worst concavity' (最差的凹面) - 'worst concave points' (最糟糕的凹點) - 'worst symmetry' (最糟糕的對稱性) - 'worst fractal dimension' (最差的分形維數) - ### .target_names: ['malignant' 'benign'] ['惡性''良性'] - ### dataframe [![](https://i.imgur.com/TvzMAYm.png)](https://i.imgur.com/TvzMAYm.png) ### wine - code: dataset ```python import sklearn wine = sklearn.datasets.load_wine() ``` - code: dataset / df_xy ```python from sklearn.datasets import load_wine import pandas wine = load_wine() df = pandas.DataFrame(wine.data, columns=wine.feature_names) df['target'] = wine.target df ``` - ### .feature_names - 'alcohol' - 'malic_acid' - 'ash' - 'alcalinity_of_ash' - 'magnesium' - 'total_phenols' - 'flavanoids' - 'nonflavanoid_phenols' - 'proanthocyanins' - 'color_intensity' - 'hue' - 'od280/od315_of_diluted_wines' - 'proline' - ### .target_names: ['class_0' 'class_1' 'class_2'] - ### dataframe ![](https://i.imgur.com/mwCcNNT.png) ## 迴歸資料集 ### boston - ### dataset - News - [關於波士頓住房數據集你不知道的事](https://towardsdatascience.com/things-you-didnt-know-about-the-boston-housing-dataset-2e87a6f960e8) - [StatLib---Datasets Archive](http://lib.stat.cmu.edu/datasets/) - [boston](http://lib.stat.cmu.edu/datasets/boston) ``` Variables in order: CRIM per capita crime rate by town ZN proportion of residential land zoned for lots over 25,000 sq.ft. INDUS proportion of non-retail business acres per town CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) NOX nitric oxides concentration (parts per 10 million) RM average number of rooms per dwelling AGE proportion of owner-occupied units built prior to 1940 DIS weighted distances to five Boston employment centres RAD index of accessibility to radial highways TAX full-value property-tax rate per $10,000 PTRATIO pupil-teacher ratio by town B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town LSTAT % lower status of the population MEDV Median value of owner-occupied homes in $1000's ``` - ### code1 ```python= import warnings from sklearn.datasets import load_boston import pandas as pd warnings.filterwarnings("ignore") boston = load_boston() df = pd.DataFrame(boston.data, columns=boston.feature_names) df['MEDV'] = boston.target df ``` ![](https://i.imgur.com/HEZXICR.png) - ### code2 ```python= from sklearn.datasets import load_boston import pandas as pd boston = load_boston() df = pd.read_csv(boston.filename, skiprows=1, header=0) df ``` - ### [code3: 官方程式碼](https://scikit-learn.org/dev/modules/generated/sklearn.datasets.load_boston.html) ```python= import warnings from sklearn.datasets import load_boston with warnings.catch_warnings(): # You should probably not use this dataset. warnings.filterwarnings("ignore") X, y = load_boston(return_X_y=True) print(X.shape) #(506, 13) ``` 官方修正聲明 ```python= import pandas as pd import numpy as np data_url = "http://lib.stat.cmu.edu/datasets/boston" raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None) data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]]) target = raw_df.values[1::2, 2] ``` - ### misc - ### [[2021 iThome 鐵人賽] 全民瘋AI系列2.0 系列](https://ithelp.ithome.com.tw/users/20107247/ironman/4723) - [[Day 16] 每個模型我全都要 - 堆疊法 (Stacking)](https://ithelp.ithome.com.tw/m/articles/10274009) - R2 - 訓練集 Score: 0.9608703782891547 - 測試集 Score: 0.9371735287625855 - MSE - 訓練集 MSE: 3.389581229598408 - 測試集 MSE: 3.9225215768179433 ## 其他資料集 - load_diabetes - 操作範例：[Automatic Logging](https://www.mlflow.org/docs/latest/tracking.html#automatic-logging) - load_digits - load_files - load_linnerud - load_sample_image - load_sample_images - load_svmlight_file - load_svmlight_files ## 隨機產生 - [Python sklearn RandomForestClassifier non-reproducible results](https://stackoverflow.com/questions/47433920) ```python X, y = make_classification(n_samples=1000, n_features=4, n_informative=2, n_redundant=0, random_state=0, shuffle=False) ``` - [13 Ensemble Learning (集體學習)](https://laoweizz.blogspot.com/2018/12/ensemble.html) ```python= from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report X, y = make_classification(n_samples=1000, n_features=100, n_informative=20, n_clusters_per_class=3, random_state=11) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=11) print(X_test.shape) clf = DecisionTreeClassifier(random_state=11) clf.fit(X_train, y_train) predictions = clf.predict(X_test) print(classification_report(y_test, predictions)) ``` <hr> # [Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets.php) > UCI 機器學習資料庫 <hr> ## Classification 分類 ### [Iris 鳶尾花](https://archive.ics.uci.edu/ml/datasets/Iris) ![](https://i.imgur.com/DRAAOZ4.png) - [資料欄位說明](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.names) | 欄位索引 | 欄位說明(en) | 欄位說明(cht) | | ------ | ------------------ | ----------- | | 1 | sepal length in cm | 花萼長(公分) | | 2 | sepal width in cm | 花萼寬(公分) | | 3 | petal length in cm | 花瓣長(公分) | | 4 | petal width in cm | 花瓣寬(公分) | | 5 | class - Iris Setosa - Iris Versicolour - Iris Virginica | 類別 - 山鳶尾 - 變色鳶尾 - 維吉尼亞鳶尾 | - [資料集下載](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data) #### python code / 資料讀取 ```python= import pandas as pd df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=None) df.tail() ``` ![](https://i.imgur.com/ybRguBM.png) ![](https://i.imgur.com/EVFCy0V.png) ![](https://i.imgur.com/d8d6WcF.png) #### python code / 建立輸出入資料 ```python= import pandas as pd df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=None) X = df.iloc[:, 0:4].values y = df.iloc[:, 4].values # -1: Iris-setosa # 0: Iris-versicolor # 1: Iris-virginica y = np.where(y == 'Iris-setosa', -1, np.where(y == 'Iris-versicolor', 0, 1)) import matplotlib.pyplot as plt plt.scatter(X[ 0: 50, 0], X[ 0: 50, 2], marker='o', color='blue', label='setosa') plt.scatter(X[ 50:100, 0], X[ 50:100, 2], marker='x', color='red', label='versicolor') plt.scatter(X[100:150, 0], X[100:150, 2], marker='o', color='green', label='virginica') plt.xlabel('sepal length (cm)') plt.ylabel('petal length (cm)') plt.legend(loc='lower right') plt.show() ``` ![](https://i.imgur.com/EHpTOY3.png) #### [seabon](https://docs.microsoft.com/zh-tw/azure/databricks/notebooks/visualizations/#seaborn) ```python= import seaborn as sns sns.set(style="white") df = sns.load_dataset("iris") g = sns.PairGrid(df, diag_sharey=True) g.map_lower(sns.kdeplot) g.map_diag(sns.kdeplot, lw=3) g.map_upper(sns.regplot) display(g.fig) ``` ![](https://i.imgur.com/20l7ONF.png) ### [Medical Appointment No Shows](https://www.kaggle.com/joniarroba/noshowappointments/discussion/39322) [![](https://i.imgur.com/FZ6WBzc.png)](https://i.imgur.com/FZ6WBzc.png) ![](https://i.imgur.com/76aiZBD.png) ![](https://i.imgur.com/BqmbPW2.png) ![](https://i.imgur.com/Pw48FIU.png) - 測試摘要 - ### random-forest: - train: 99.99% - predict: 78.82% - ### auto-sklearn: - 1 hour - train: 87.23% - predict: 80.45% - 20 hour - train: 83.65% - Best validation score: 0.797441 - predict: 80.75% ### [Adult Data Set](https://archive.ics.uci.edu/ml/datasets/adult) (binary) ![](https://i.imgur.com/dtv9MZV.png) - ### [Tabular Data Example](https://towardsdatascience.com/autogluon-deep-learning-automl-5cdb4e2388ec#065f) - data | data | shape | download link | | ---- | ----- | ------------- | | train | 39073 x 15 | https://autogluon.s3.amazonaws.com/datasets/AdultIncomeBinaryClassification/train_data.csv | | test | 9769 x 15 | https://autogluon.s3.amazonaws.com/datasets/AdultIncomeBinaryClassification/test_data.csv | - features 1. age 2. workclass 3. fnlwgt 4. education 5. education-num 6. marital-status 7. occupation 8. relationship 9. race 10. sex 11. capital-gain 12. capital-loss 13. hours-per-week 14. native-country - target - class - `<=50K` - `>50K` - models [![](https://i.imgur.com/o8f3L9v.png)](https://i.imgur.com/o8f3L9v.png) - metrics - AutoGluon (GBM+NN_TORCH) - accuracy: 0.8752175248234211 - balanced_accuracy: 0.7985774242740231 - mcc: 0.6384055943366135 - roc_auc: 0.9292811684599376 - f1: 0.7128386336866903 - precision: 0.785158277114686 - recall: 0.6527178602243313 - AutoGluon (`auto_stack=False`) - accuracy: 0.8763435356740711 - AutoGluon (`auto_stack=True`) - accuracy: 0.8760364418057119 - AutoSklearn - accuracy: 0.873170232367694 (24H@twcc) <hr> ## Regression 回歸 ### [不動產估價](https://archive.ics.uci.edu/ml/datasets/Real+estate+valuation+data+set) ![](https://i.imgur.com/lZqabKX.png) - 資料欄位說明 - Input: X1=the transaction date (for example, 2013.250=2013 March, 2013.500=2013 June, etc.) X2=the house age (unit: year) X3=the distance to the nearest MRT station (unit: meter) X4=the number of convenience stores in the living circle on foot (integer) X5=the geographic coordinate, latitude. (unit: degree) X6=the geographic coordinate, longitude. (unit: degree) - Output: Y= house price of unit area (10000 New Taiwan Dollar/Ping, where Ping is a local unit, 1 Ping = 3.3 meter squared) - [資料集下載](https://archive.ics.uci.edu/ml/machine-learning-databases/00477/Real%20estate%20valuation%20data%20set.xlsx) ### 房屋數據集 - [資料集下載](https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data) (空白間隔, 506列x14行) - 資料欄位說明 ![](https://i.imgur.com/k9tyqFC.png) ![](https://i.imgur.com/IvKv0m0.png) - Input: - 欄位：1^st^ (CRIM) ~ 13^th^ (LSTAT) - Output: - 欄位：14^th^ (MEDV) #### python code ```python= import pandas as pd url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data' df = pd.read_csv(url, header=None, delim_whitespace=True) df.columns = [ 'CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'] # debug df.columns df.iloc[0] df.iloc[0][0] # =0.00632 df.iloc[0]['CRIM'] # =0.00632 df.head() # ----- import matplotlib.pyplot as plt import seaborn as sns sns.set(style='whitegrid', context='notebook') cols = ['LSTAT', 'INDUS', 'NOX', 'RM', 'MEDV'] sns.pairplot(df[cols], height=2.5) plt.show() # ----- import numpy as np cm = np.corrcoef(df[cols].values.T) sns.set(font_scale=1.5) hm = sns.heatmap( cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 15}, yticklabels=cols, xticklabels=cols) plt.show() ``` ![](https://i.imgur.com/EgZkPRV.png) ![](https://i.imgur.com/0pcvvBo.png) ### [House Prices - Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/overview) <hr> # Kaggle (醫療相關) ## [Medical Cost Personal Datasets](https://www.kaggle.com/mirichoi0218/insurance) - Insurance Forecast by using Linear Regression 使用線性回歸的保險預測 - 任務: Regression (預估保險費用) - 資料筆數: 1338 筆 - 資料欄位: age,sex,bmi,children,smoker,region,charges ## [Medical Appointment No Shows](https://www.kaggle.com/joniarroba/noshowappointments) - Why do 30% of patients miss their scheduled appointments? 為什麼30％的患者錯過了預定的看診日？ - 任務: Classification (判斷預約掛診後是否會來就醫) - 資料筆數: 110527 筆 - 資料欄位: PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age... 13 + 1 個 - Hipertension 高血壓 - Diabetes 糖尿病 - Alcoholism 酒精中毒 - ### 資料樣貌 ```python import pandas_profiling pandas_profiling.ProfileReport(df) ``` ![](https://i.imgur.com/zex2Tl3.png) - ### [實驗結果](https://hackmd.io/wgExODglS_iltzcWYApEpA) - acc 最高是 80.8% - SGD - BernoulliNB - SVM / SVC - MLP ## [Pima Indians Diabetes Database](https://www.kaggle.com/uciml/pima-indians-diabetes-database) - Predict the onset of diabetes based on diagnostic measures 根據診斷措施預測糖尿病的發作 - 任務: Classification (是否患有糖尿病) - 資料筆數: 768 筆 - 資料欄位: Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome ## [Cardiovascular Disease dataset](https://www.kaggle.com/sulianova/cardiovascular-disease-dataset) - 任務: Classification (有無心血管疾病) - 資料筆數: 70000 筆 - 資料欄位: id;age;gender;height;weight;ap_hi;ap_lo;cholesterol;gluc;smoke;alco;active;cardio ## [Heart Failure Prediction](https://www.kaggle.com/andrewmvd/heart-failure-clinical-data) - 任務: Classification (判斷是否心臟衰竭) - 資料筆數: 299 筆 - 資料欄位: age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure..等 12+1個 ## [Heart Disease UCI](https://www.kaggle.com/ronitf/heart-disease-uci) - 任務: Classification (判斷是否心臟疾病) - 資料筆數: 303 筆 - 資料欄位: age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target # Kaggle (非醫療相關) ## [Predicting Term Deposit Subscriptions (預測定期存款訂閱)](https://www.kaggle.com/janiobachmann/bank-marketing-dataset/discussion) - 總共 11162 筆數據, 每筆數據有 17 個欄位 Integer x 7 String x 6 Boolean x 4 ## Palmer Archipelago (Antarctica) penguin data - ### [[Kaggle] Palmer Archipelago (Antarctica) penguin data](https://www.kaggle.com/parulpandey/palmer-archipelago-antarctica-penguin-data) - ### [[github] allisonhorst / palmerpenguins](https://github.com/allisonhorst/palmerpenguins/blob/master/README.md) ![](https://i.imgur.com/8O88bsE.png) - [data-raw/penguins.R](https://github.com/allisonhorst/palmerpenguins/blob/master/data-raw/penguins.R) - 有提到資料集來源 - Adelie penguin data from: https://doi.org/10.6073/pasta/abc50eed9138b75f54eaada0841b9b86 - Gentoo penguin data from: https://doi.org/10.6073/pasta/2b1cff60f81640f182433d23e68541ce - Chinstrap penguin data from: https://doi.org/10.6073/pasta/409c808f8fc9899d02401bdb04580af7 - 見底下 EDI Data Portal - ### [EDI Data Portal](https://portal.edirepository.org/nis/home.jsp) (最新資料集) - **阿德利企鵝（Adelie penguin）資料：** https://doi.org/10.6073/pasta/98b16d7d563f265cb52372c8ca99e60f 版本: knb-lter-pal.219.5 (2020-06-08) - **巴布亞企鵝（Gentoo penguin）資料：** https://doi.org/10.6073/pasta/9fc8f9b5a2fa28bdca96516649b6599b 版本：knb-lter-pal.220.7 (2020-10-24) - **南極企鵝（Chinstrap penguin）資料：** https://doi.org/10.6073/pasta/ce9b4713bb8c065a8fcfd7f50bf30dde 版本：knb-lter-pal.221.8 (2020-10-24) - ### 欄位說明 - [https://allisonhorst.github.io/palmerpenguins/reference/penguins_raw.html](https://allisonhorst.github.io/palmerpenguins/reference/penguins_raw.html) - EDI Data Portal 的「View Full Metadata」亦有欄位說明 - 例如 Adelie 企鵝 [Full Metadata](https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-pal.219.5) ![](https://i.imgur.com/rU1qWz1.png) - 資料分析 - [penguin dataset : The new Iris](https://www.kaggle.com/parulpandey/penguin-dataset-the-new-iris) ## [House Prices - Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data) (房價預測) - 訓練集：1460筆x81欄位 - 測試集：1459x80欄位 - ### 解析 - [3.16. 实战Kaggle比赛：房价预测](https://zh.d2l.ai/chapter_deep-learning-basics/kaggle-house-price.html) :+1: :100: - ### Google AutoML Tables - ### 相關性 (對照 AzureML，這個應該是皮耳森相互關聯) ![](https://i.imgur.com/B7s7JwU.png) - ### Azure AutoML - ### 遺漏特徵值插補 ![](https://i.imgur.com/W4auPuO.png) ![](https://i.imgur.com/ReaMiA5.png) - ### 高基數特徵偵測 ![](https://i.imgur.com/xyD28Ls.png) ![](https://i.imgur.com/9C456IT.png) - Id, LotFrontage, GarageYrBlt ![](https://i.imgur.com/i6Usxsw.png) - ### 特徵重要度 - top features - **GrLivArea**: Above grade (ground) living area square feet - **GarageCars**: Size of garage in car capacity - **TotalBsmtSF**: Total square feet of basement area - **ExterQual**: Evaluates the quality of the material on the exterior - **YearBuilt**: Original construction date - StackEnsemble ![](https://i.imgur.com/i6RkU0z.png) - VotingEnsemble ![](https://i.imgur.com/4lIgWGt.png) - XGBoostRegressor ![](https://i.imgur.com/AcfxUhM.png) - ### 設計工具：[統計分析關聯性：皮耳森相互關聯(Pearson correlation)](https://docs.microsoft.com/zh-tw/azure/machine-learning/algorithm-module-reference/filter-based-feature-selection) ![](https://i.imgur.com/okPr6pU.png) ```yaml [ { "SalePrice": 1.0, "OverallQual": 0.790981600583805, "GrLivArea": 0.7086244776126522, "GarageCars": 0.6404091972583521, "GarageArea": 0.6234314389183618, "TotalBsmtSF": 0.6135805515591953, "1stFlrSF": 0.6058521846919146, "FullBath": 0.560663762748446, "TotRmsAbvGrd": 0.5337231555820281, "YearBuilt": 0.5228973328794969, "YearRemodAdd": 0.5071009671113861, "MasVnrArea": 0.4774930470957156, "Fireplaces": 0.4669288367515278, "BsmtFinSF1": 0.38641980624215333, "WoodDeckSF": 0.3244134445681299, "2ndFlrSF": 0.31933380283206775, "OpenPorchSF": 0.31585622711605527, "HalfBath": 0.28410767559478245, "LotArea": 0.26384335387140573, "CentralAir": 0.251328163840155, "BsmtFullBath": 0.22712223313149427, "BsmtUnfSF": 0.21447910554696886, "BedroomAbvGr": 0.16821315430074, "KitchenAbvGr": 0.13590737084214122, "EnclosedPorch": 0.12857795792595672, "ScreenPorch": 0.1114465711429112, "PoolArea": 0.09240354949187321, "MSSubClass": 0.08428413512659517, "OverallCond": 0.07785589404867801, "MoSold": 0.046432245223819335, "3SsnPorch": 0.04458366533574842, "YrSold": 0.028922585168730326, "LowQualFinSF": 0.025606130000679548, "Id": 0.021916719443431102, "MiscVal": 0.021189579640303255, "BsmtHalfBath": 0.01684415429735901, "BsmtFinSF2": 0.011378121450215153, "KitchenQual": 0.0, "Functional": 0.0, "FireplaceQu": 0.0, "GarageType": 0.0, "GarageYrBlt": 0.0, "GarageFinish": 0.0, "GarageQual": 0.0, "GarageCond": 0.0, "PavedDrive": 0.0, "PoolQC": 0.0, "Fence": 0.0, "MiscFeature": 0.0, "SaleType": 0.0, "HeatingQC": 0.0, "Electrical": 0.0, "Heating": 0.0, "MSZoning": 0.0, "LotFrontage": 0.0, "Street": 0.0, "Alley": 0.0, "LotShape": 0.0, "LandContour": 0.0, "Utilities": 0.0, "LotConfig": 0.0, "LandSlope": 0.0, "Neighborhood": 0.0, "Condition1": 0.0, "Condition2": 0.0, "BldgType": 0.0, "HouseStyle": 0.0, "RoofStyle": 0.0, "RoofMatl": 0.0, "Exterior1st": 0.0, "Exterior2nd": 0.0, "MasVnrType": 0.0, "ExterQual": 0.0, "ExterCond": 0.0, "Foundation": 0.0, "BsmtQual": 0.0, "BsmtCond": 0.0, "BsmtExposure": 0.0, "BsmtFinType1": 0.0, "BsmtFinType2": 0.0, "SaleCondition": 0.0 } ] ``` - ### 設計工具：[統計分析關聯性：卡方平方(Chi squared)](https://docs.microsoft.com/zh-tw/azure/machine-learning/algorithm-module-reference/filter-based-feature-selection) ![](https://i.imgur.com/vOCWpvt.png) ```json= [ { "SalePrice": 1.0, "LotFrontage": 3011.821838830588, "GrLivArea": 2684.988619593563, "OverallQual": 2076.4370435123515, "2ndFlrSF": 1729.135623005293, "Neighborhood": 1718.9416421931924, "GarageYrBlt": 1444.01487442007, "MasVnrArea": 1232.978646691799, "TotalBsmtSF": 1208.6424283049907, "GarageCars": 1154.0818931305969, "GarageArea": 1117.3938558971226, "BsmtQual": 1089.2201890097551, "ExterQual": 1042.6353408934951, "1stFlrSF": 1039.0358146067674, "KitchenQual": 986.0213093615508, "YearBuilt": 844.8493742946195, "FullBath": 834.075902967989, "TotRmsAbvGrd": 812.9894565705961, "PoolArea": 761.8265722853805, "GarageFinish": 707.8632871546307, "FireplaceQu": 678.7779866327791, "BsmtFinSF1": 662.9655270176974, "YearRemodAdd": 661.1373017171591, "MSSubClass": 660.5744268968961, "GarageType": 634.2167951745353, "Foundation": 564.3189858452354, "Exterior2nd": 513.098610732704, "BsmtFinType1": 490.98687575736005, "LotArea": 489.7292604702702, "OpenPorchSF": 476.0364376643361, "Exterior1st": 470.05636643687404, "Fireplaces": 452.174469195772, "BsmtUnfSF": 403.7488934069761, "PoolQC": 374.3439140033527, "HeatingQC": 367.62791254066514, "WoodDeckSF": 359.9288380271235, "OverallCond": 339.3392846787617, "MasVnrType": 336.06740869693033, "MSZoning": 318.4825084183071, "BsmtExposure": 313.85916169813373, "GarageQual": 304.205287648579, "SaleType": 272.2658167083576, "SaleCondition": 266.56895904550186, "GarageCond": 264.01574101164795, "HouseStyle": 260.6223484176385, "BedroomAbvGr": 213.11479134627658, "CentralAir": 202.04060045537705, "HalfBath": 188.09435622298125, "LotShape": 184.24822870560422, "BsmtCond": 176.7307642825022, "RoofMatl": 173.2228239075082, "PavedDrive": 166.95685885309413, "BsmtFinType2": 160.1892640411418, "Electrical": 157.74469500376358, "ScreenPorch": 156.6847308357881, "EnclosedPorch": 147.07828586418742, "RoofStyle": 132.39898289223495, "Condition1": 119.59281692687587, "Fence": 118.35962619971271, "BsmtFullBath": 111.4363605643106, "BldgType": 92.76996588414389, "MoSold": 88.41172079015419, "BsmtFinSF2": 82.90237961824074, "Functional": 82.05618468076621, "LotConfig": 77.14912520796912, "Id": 72.06877839487606, "LowQualFinSF": 71.6470236213088, "Alley": 69.71665915090168, "ExterCond": 63.42406887735416, "LandContour": 61.538330919443055, "Heating": 60.68961833983211, "KitchenAbvGr": 55.78543944013338, "3SsnPorch": 51.077139267999875, "Condition2": 41.64240250802226, "YrSold": 37.35295246655157, "MiscVal": 28.367347205758172, "BsmtHalfBath": 27.5360756465552, "MiscFeature": 19.55442424515907, "LandSlope": 19.518483799644102, "Street": 5.84078739669776, "Utilities": 2.017911056480437 } ] ``` - ### House Prices (房價預測) 評分結果 | Platform | Model | 自評：Std RMSE | Kaggle：RMSE | Rank | | -------- | -------- | -------- | -------- | -------- | | ==ASUS AI Maker== | AutoRegrssor | 15013.36 (non-std) | 0.12742 :+1: | 1909 / 8200 | | Google Vertex AI | - | 23,895.473 (non-std) | 0.13608 | 4333 / 10815 (2021/05/24) | | Azure AutoML | StackEnsemble | 0.03519 | 0.13771 | 3459 / 8200 | | Azure AutoML | MaxAbsScaler, LightGBM | 0.03753 | 0.13841 | 3537 / 8200 | | Azure AutoML | VotingEnsemble | 25239(non-std) / 0.03505 | 0.13876 | 3607 / 8200 | | Azure AutoML | MaxAbsScaler, XGBoostRegressor | 0.03879 | 0.14746 | 4732 / 8200 | | Google AutoML (改版) | - | 41445.203 (non-std) | 0.15973 | 7454 / 10810 (2021/05/24) | | Azure AutoML | MaxAbsScaler, RandomForest | 0.04158 | 0.16989 | 6230 / 8200 | | Google AutoML (old) | :warning: cannot handle unknown labels | - | - | - | ## [Titanic - Machine Learning from Disaster](https://www.kaggle.com/c/titanic/data) > ### 鐵達尼號 ### 相關下載點 - [pandas-videos / data / titanic_train.csv](https://github.com/justmarkham/pandas-videos/blob/master/data/titanic_train.csv) - 短網址：http://bit.ly/kaggletrain ([資訊來源](https://leemeng.tw/practical-pandas-tutorial-for-aspiring-data-scientists.html)) - [michhar / titanic.csv](https://gist.github.com/michhar/2dfd2de0d4f8727f873422c5d959fff5) - [plotly / datasets](https://github.com/plotly/datasets/blob/master/titanic.csv) ## Wine Quality > [[HackMD] Red Wine Quality](https://hackmd.io/GYh8h8zBT3ucNF1w09r1AQ) ### 相關下載點 - https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009 - https://www.kaggle.com/datasets/rajyellow46/wine-quality - https://www.kaggle.com/datasets/yasserh/wine-quality-dataset - https://www.kaggle.com/datasets/fedesoriano/spanish-wine-quality-dataset <hr> # Azure 資料集範例 ## [開放資料集總覽](https://docs.microsoft.com/zh-tw/azure/open-datasets/dataset-catalog#AzureNotebooks) ### Transportation ### 健全狀況和 genomics ### 人力和經濟效益 ### 人口和安全性 ### 補充和一般資料集 | 資料集 | Description | |-------|-------------| | 糖尿病 | 糖尿病資料集有 442 份具有 10 項特徵的範例，因此很適合作為機器學習演算法入門。| | OJ 銷售模擬資料 | 此資料集衍生自 Dominick 的 OJ 資料集，並包含額外的模擬資料，其目標是提供資料集，讓您可以輕鬆地在 Azure Machine Learning 上訓練數以千計的模型。 | | 手寫數位的 MNIST 資料庫 | 手寫數字的 MNIST 資料庫有 60,000 個範例的訓練集，以及 10,000 個範例的測試集。數字已大小正規化且在固定大小的影像置中。| | Microsoft 新聞建議資料集 | Microsoft 新聞資料集 (主意) 是新聞建議研究的大規模資料集。它可作為新聞建議的基準資料集，並可協助研究新聞建議和推薦系統。 | | **公共假日** | 來自 PyPI 假日套件和 Wikipedia 的全球國定假日資料，涵蓋 1970 年至 2099 年的 38 個國家或地區。 | | 俄文開啟語音轉換文字 | 俄文 Open STT 是適用于俄文語言的大型開放語音轉換文字資料集 | ## 迴歸資料 - ### [Azure Machine Learning 設計工具的範例管線和資料集](https://docs.microsoft.com/zh-tw/azure/machine-learning/samples-designer) | 任務類型 | 範例標題 | 描述 | | -------- | -------- | -------- | | 迴歸 | [紐約市計程車資料](https://docs.microsoft.com/zh-tw/azure/machine-learning/tutorial-auto-train-models) | 使用 AutoML / regression | | 迴歸 | [汽車價格預測 (基本)](https://github.com/Azure/MachineLearningDesigner/blob/master/articles/samples/regression-automobile-price-prediction-basic.md) | 使用線性回歸來預測汽車價格。 | | 迴歸 | [汽車價格預測 (進階)](https://github.com/Azure/MachineLearningDesigner/blob/master/articles/samples/regression-automobile-price-prediction-compare-algorithms.md) | 使用決策樹系和推進式決策樹迴歸輸入變數來預測汽車價格。比較這兩個模型找出最佳的演算法。 | ## 時間序列資料 > #data #dataset #time-series #ml #train #predict #forcast > - ### 每小時能源需求 > https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/nyc_energy.csv > timeStamp時間戳記, demand能源需求, precip降雨, temp溫度 > 每小時的能源需求和基本天氣資料。 - 資料來源： [[doc] 組態設定](https://docs.microsoft.com/zh-tw/azure/machine-learning/how-to-auto-train-forecast#configuration-settings) ![](https://i.imgur.com/m0m7CFQ.png) - [auto-ml-forecasting-energy-demand.ipynb](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/automated-machine-learning/forecasting-energy-demand/auto-ml-forecasting-energy-demand.ipynb) ![](https://i.imgur.com/R480R8i.png) - [data from New York City](http://mis.nyiso.com/public/P-58Blist.htm) <hr> # 資料集擴增 - 2021/06/19 - [Facebook AI Open Sources AugLy: A New Python Library For Data Augmentation To Develop Robust Machine Learning Models](https://www.marktechpost.com/2021/06/19/facebook-ai-open-sources-augly-a-new-python-library-for-data-augmentation-to-develop-robust-machine-learning-models/) - AugLy is a new open-source data augmentation library that combines audio, image, video, and text, becoming increasingly significant in several AI research fields. - Sample [![](https://i.imgur.com/BtQ6KpQ.png)](https://i.imgur.com/BtQ6KpQ.png) [![](https://i.imgur.com/Kww9u7p.jpg)](https://i.imgur.com/Kww9u7p.jpg) <hr> # 其他資料集 ## 分類 ### 信用卡盜刷偵測 - ### [[2021 iThome 鐵人賽] 全民瘋AI系列2.0 系列](https://ithelp.ithome.com.tw/users/20107247/ironman/4723) - [[Day 17] 輕量化的梯度提升機 - LightGBM](https://ithelp.ithome.com.tw/m/articles/10274577) - dataset https://storage.googleapis.com/download.tensorflow.org/data/creditcard.csv ## 迴歸 ## 時間序列 <hr> # 他人見解 - [這樣用資料才能幫企業解決問題](https://aiacademy.tw/how-companies-use-data/) [![](https://i.imgur.com/3LBMup9.png)](https://i.imgur.com/3LBMup9.png) <hr> > **資料不等於價值，用對地方才有價值** > > 台灣人工智慧學校執行長陳昇瑋在《人工智慧在台灣》一書中曾提醒，**許多企業會以為蒐集的資料本身就具有價值**，但事實上，資料必須經過處理、分析及開發才會變成**最終產品**，可能是一份分析報告、一個特定決策的建議。 > > 換句話說，資料若沒有經過妥善的「加工處理」和「萃取分析」，本身的價值是尚未被開發與決定的，而使用它的人，也必須有能力將「對的資料」以「對的方式」應用在「對的場景」。 ![](https://i.imgur.com/8b2Sxx6.png)