Wine Classes - HackMD

Wine Classes === > [[ML / Category]](/hu2F_DDbSI6nHzO2NbJ_Jg) ###### tags: `ML / 資料集` ###### tags: `ML`, `資料集`, `dataset`, `sklearn`, `python`, `釀酒` :::info **葡萄酒來源**：種植在義大利同一地區的三種不同品種葡萄所釀出來的酒。 **分析**：根據化學分析產生30種化學特性，目前只有 13 種化學特性有完整資料。 **目標**：根據這 13 種化學特性，推斷是 3 種葡萄酒的那一種？ ::: [TOC] ## [UCI] Wine Data Set > https://archive.ics.uci.edu/ml/datasets/wine - [scikit-learn 資料同 UCI](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html) > The copy of UCI ML Wine Data Set dataset is downloaded and modified to fit standard format from: https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data - dataset info - These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines. - I think that the initial data set had around 30 variables, but for some reason I only have the 13 dimensional version. I had a list of what the 30 or so variables were, but a.) I lost it, and b.), I would not know which 13 variables are included in the set. <hr> ## features | 欄位名稱 | 欄位說明 | 資料類型 | | ---------------------------- | ---------- | -------- | | Alcohol | 酒精 | float | | Malic acid | 蘋果酸 | float | | Ash | 灰 | float | | Alcalinity of ash | 灰的鹼度 | float | | Magnesium | 鎂 | float | | Total phenols | 總酚 | float | | Flavanoids | 黃酮類化合物| float | | Nonflavanoid phenols | 非黃酮酚類 | float | | Proanthocyanins | 原花青素 | float | | Color intensity | 成色,色彩濃度| float | | Hue | 色相 | float | | OD280/OD315 of diluted wines | 吸光度 | float | | Proline | 脯氨酸 | float | | Class | 分類 | int | - alcohol [ˋælkə͵hɔl] - alcalinity, alkalinity [͵ælkəˋlɪnətɪ] - ash [æʃ] - diluted [daɪˋlutɪd] - flavanoids 為 flavonoids 一種(?) - https://en.wikipedia.org/wiki/Flavonoid - magnesium [mægˋniʃɪəm] - malic [ˋmælɪk] - phenol, phenols [ˋfinɔl] - proline [ˈproˌlin] - 一些名詞翻譯的參考資料 - [虎尾科技大學管理學院 - 資料庫中心 - 資料說明](http://management.nfu.edu.tw/ezfiles/22/1022/img/53/121545875.pdf) ## sklearn > [`sklearn.datasets.load_wine`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html) ### dataset 資訊 | Classes | 3 | |-------------------|----------------| | Samples per class | [59,71,48] | | Samples total | 178 | | Dimensionality | 13 | | Features | real, positive | ### dataframe ```python from sklearn.datasets import load_wine from pandas import DataFrame from IPython.display import display wine = load_wine() # 'data': df_X # 'feature_names': [ # 'alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', # 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', # 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline' # ] # 'target': df_y # 'target_names': array(['class_0', 'class_1', 'class_2'], dtype='<U7'), # 'frame': None print("attributes:", dir(wine)) #print(wine.DESCR) df_Xy = DataFrame(wine.data, columns=wine.feature_names) df_Xy['target'] = wine.target display(df_Xy) ``` [![](https://i.imgur.com/QUhFqjm.png)](https://i.imgur.com/QUhFqjm.png) ### 基本統計 ```python df_Xy.describe() ``` [![](https://i.imgur.com/ZIlYFob.png)](https://i.imgur.com/ZIlYFob.png) ### 分割資料 ```python= from sklearn.datasets import load_wine from sklearn.model_selection import train_test_split from pandas import DataFrame from IPython.display import display wine = load_wine() df_X = DataFrame(wine.data, columns=wine.feature_names) df_y = DataFrame(wine.target, columns=['target']) train_df_X, test_df_X, train_df_y, test_df_y = train_test_split( df_X, df_y, random_state=0, shuffle=True, stratify=df_y) display('train_df_X:', train_df_X) display('train_df_y:', train_df_y) display('test_df_X:', test_df_X.tail()) display('test_df_y:', test_df_y.tail()) train_df_X.to_csv('train_X.csv', index=False) train_df_y.to_csv('train_y.csv', index=False) test_df_X.to_csv('test_X.csv', index=False) test_df_y.to_csv('test_y.csv', index=False) ``` ### 屬性重要性 | features(en) | features(cht) | importance | | -------- | ---------- | ---------- | | proline | 脯氨酸 | 16.30% | | flavanoids | 黃酮類化合物 | 12.52% | | alcohol | 酒精 | 11.72% | | color_intensity | 成色 | 10.90% | | od280/od315_of_diluted_wines | 吸光度 | 9.89% | | hue | 色相 | 9.43% | | total_phenols | 總酚 | 6.91% | | malic_acid | 蘋果酸 | 4.63% | | alcalinity_of_ash | 灰的鹼度 | 4.42% | | nonflavanoid_phenols | 非黃酮酚類 | 3.74% | | magnesium | 鎂 | 3.64% | | proanthocyanins | 原花青素 | 3.31% | | ash | 灰 | 2.59% | ```python= from sklearn.ensemble import ExtraTreesClassifier from sklearn.metrics import accuracy_score, f1_score clf = ExtraTreesClassifier(random_state=0) # train # ----- # A column-vector y was passed when a 1d array was expected. Please # change the shape of y to (n_samples,), for example using ravel(). clf.fit(train_df_X, train_df_y.values.ravel()) # predict # ------- y_true = test_df_y y_pred = clf.predict(test_df_X) # evaluate # -------- print('acc:', accuracy_score(y_true, y_pred)) print('f1-weighted:', f1_score(y_true, y_pred, average='weighted')) # feature importances # ------------------- # sum(clf.feature_importances_) = 100% feature_importances = zip(clf.feature_names_in_, clf.feature_importances_) feature_importances = sorted(feature_importances, key=lambda item: item[1], reverse=True) for k, v in feature_importances: #print(k, v) print(f'| {k} | {v*100:0.2f}% |') ``` <hr> ## 實測屬性重要性 ### 單一屬性重要性 (橫向表格) ```python= y_true = test_df_y.values.ravel() metrics = [ ('acc', lambda y_true, y_pred: accuracy_score(y_true, y_pred)), ('f1', lambda y_true, y_pred: f1_score(y_true, y_pred, average='weighted')), ] # init the table leaderboard = DataFrame(columns=train_df_X.columns) for metric in metrics: row_label = metric[0] leaderboard.loc[row_label] = [0.0] * len(train_df_X.columns) # train & predict for col_name in train_df_X.columns: for metric in metrics: X = train_df_X[[col_name]] y = train_df_y.values.ravel() clf = ExtraTreesClassifier(random_state=0) clf.fit(X, y) y_pred = clf.predict(test_df_X[[col_name]]) # fill the result row_label = metric[0] evaluate = metric[1] result = evaluate(y_true, y_pred) leaderboard.loc[row_label][col_name] = result display(leaderboard) # normalize to 100% (in percentage) for each metric leaderboard.columns.name = '(%)' for row_idx in range(leaderboard.shape[0]): # sum all columns for each metric row_label = metrics[row_idx][0] total = sum(leaderboard.loc[row_label]) # calculate the percentage for col_idx in range(leaderboard.shape[1]): value = leaderboard.loc[row_label][col_idx] value = int(value * 10000 / total) / 100 leaderboard.loc[row_label][col_idx] = value print('normalize to 100% (in percentage) for each metric') display(leaderboard) ``` [![](https://i.imgur.com/fbcXP9x.png)](https://i.imgur.com/fbcXP9x.png) | features(en) | features(cht) | importance | | -------- | ---------- | ---------- | | color_intensity | 成色 | 9.79% | | hue | 色相 | 9.22% | | alcohol | 酒精 | 8.93% | | proanthocyanins | 原花青素 | 8.93% | | flavanoids | 黃酮類化合物 | 8.35% | | malic_acid | 蘋果酸 | 8.35% | | proline | 脯氨酸 | 8.06% | | total_phenols | 總酚 | 7.78% | | alcalinity_of_ash | 灰的鹼度 | 7.20% | | od280/od315_of_diluted_wines | 吸光度 | 6.91% | | magnesium | 鎂 | 6.34% | | ash | 灰 | 5.18% | | nonflavanoid_phenols | 非黃酮酚類 | 4.89% | ### 單一屬性重要性 (縱向表格) ```python= from sklearn.ensemble import ExtraTreesClassifier from sklearn.metrics import accuracy_score, f1_score y_true = test_df_y.values.ravel() metrics = [ ('acc', lambda y_true, y_pred: accuracy_score(y_true, y_pred)), ('f1', lambda y_true, y_pred: f1_score(y_true, y_pred, average='weighted')), ] # init the table columns = [metric[0] for metric in metrics] leaderboard = DataFrame(columns=columns) # train & predict for feature_name in train_df_X.columns: X = train_df_X[[feature_name]] y = train_df_y.values.ravel() clf = ExtraTreesClassifier(random_state=0) clf.fit(X, y) y_pred = clf.predict(test_df_X[[feature_name]]) rows, cols = leaderboard.shape new_row = [0.0] * cols for col_idx, metric in enumerate(metrics): evaluate = metric[1] new_row[col_idx] = evaluate(y_true, y_pred) leaderboard.loc[feature_name] = new_row display(leaderboard) # normalize to 100% (in percentage) for each metric leaderboard.columns.name = '(%)' for col_idx in range(leaderboard.shape[1]): # sum all columns for each metric col_name = metrics[col_idx][0] total = sum(leaderboard[col_name]) # calculate the percentage for row_idx in range(leaderboard.shape[0]): value = leaderboard[col_name][row_idx] value = int(value * 10000 / total) / 100 leaderboard[col_name][row_idx] = value print('normalize to 100% (in percentage) for each metric:') display(leaderboard) ``` ![](https://i.imgur.com/84mjI3l.png) ![](https://i.imgur.com/sP968gB.png) <hr> ## Kaggle 資料 ### 相關下載點 - https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009 - https://www.kaggle.com/datasets/rajyellow46/wine-quality - https://www.kaggle.com/datasets/yasserh/wine-quality-dataset - https://www.kaggle.com/datasets/fedesoriano/spanish-wine-quality-dataset <hr> ## 參考資料 ![](https://i.imgur.com/cn6foKK.png) - [關於麴，你應該知道的幾件事情](https://sakemaru.me/tw/blog/learn/關於麴，你應該知道的幾件事情/) - [綠色葡萄酒：稚嫩尚青草食男](https://www.gq.com.tw/blog/ingrid/detail-521.html) - [【葡萄酒釀造】紅葡萄酒釀造流程](http://www.winenote.com.tw/post/557/) ![](https://i.imgur.com/scfFJiq.png) - [【品酒菜鳥入門第一課】紅葡萄酒釀造六大步驟](http://tellmewine.com.tw/tw/news/show.aspx?nuit=2&num=48) - 因紅酒的顏色來自葡萄皮，所以最關鍵的地方在於如何成功地萃取果皮中的單寧和色素。 - [酒的顏色要怎樣看?](https://www.facebook.com/vinexwineacademy/posts/3444368148920339/) ![](https://i.imgur.com/TVTXDc7.png) - [葡萄酒颜色须知之基本要素](https://www.wine-world.com/course/rm/20130506112542000) - [葡萄品種特徵入門(一）](https://davincwine.com/%e8%91%a1%e8%90%84%e5%93%81%e7%a8%ae%e7%89%b9%e5%be%b5%e5%85%a5%e9%96%80/) - 即使以不同的稱謂生長並使用不同的技術進行釀造，特定的葡萄酒總是會顯示葡萄個性中固有的某些特質。