資料探勘 Homework #2

--- tags: Data Mining, Python disqus: HackMD --- 資料探勘 Homework #2 === Classification --- * Goal * Understand what classification systems do and the difference between real behavior of classification model and observed data * Description * Construct a classification model to observe the difference between real ‘right’ data and modeled data Flow --- * Step 1: Design a set of rules to classify data, e.g., classify students with good performance. * You should design k features/attributes for your problems first. * Use ‘absolutely right’ rules to generate your positive and negative data (the number of data = M) * Step 2: Use the data generated in Step 1 to construct your classification model * Decision tree is basic requirement, you can add more classification models. * Step 3: Compare the rules in the decision tree from Step 2 and the rules you used to generate your ‘right’ data * Step 4: Discuss anything you can Example --- * Select a good apple * Your “absolute right” rule (R) * Color: dark red * Knock voice: sharp * Head color: green * Weight: medium (hidden) * Use (R) to generate your data * Add more attributes (20+ is better) * Use classifiers to classify your data * Decision tree * Naïve bayes * ... ```python= import pandas as pd import numpy as np from sklearn import preprocessing from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix, average_precision_score, precision_score, recall_score, f1_score, classification_report from sklearn.metrics import accuracy_score from sklearn.tree import DecisionTreeClassifier base_path = '' file_path = base_path + 'data/bike_buyers_clean.csv' dataSet = pd.read_csv(file_path, sep=',') dataSet.head() ``` ![](https://i.imgur.com/BY9MGlM.png) ```python= dataSet.shape ``` (1000, 13) 刪除無意義欄位 --- ```python= dataSet = dataSet.drop('ID', axis=1) dataSet.head() ``` ![](https://i.imgur.com/ZUYuodA.png) ```python= dataSet.shape ``` (1000, 12) 分析數值型資料欄位 --- ```python= numerical = [var for var in dataSet.columns if dataSet[var].dtype!='O'] dataSet[numerical].head() ``` ![](https://i.imgur.com/Nandlj5.png) 分析非數值型欄位 --- ```python= categorical = [var for var in dataSet.columns if dataSet[var].dtype=='O'] dataSet[categorical].head() ``` ![](https://i.imgur.com/9A6ETPV.png) 分析分類變數出現頻率 --- ```python= for var in categorical: print(dataSet[var].value_counts()) print(dataSet[var].value_counts()/np.float(len(dataSet))) print() ``` ``` Married 539 Single 461 Name: Marital Status, dtype: int64 Married 0.539 Single 0.461 Name: Marital Status, dtype: float64 Male 509 Female 491 Name: Gender, dtype: int64 Male 0.509 Female 0.491 Name: Gender, dtype: float64 Bachelors 306 Partial College 265 High School 179 Graduate Degree 174 Partial High School 76 Name: Education, dtype: int64 Bachelors 0.306 Partial College 0.265 High School 0.179 Graduate Degree 0.174 Partial High School 0.076 Name: Education, dtype: float64 Professional 276 Skilled Manual 255 Clerical 177 Management 173 Manual 119 Name: Occupation, dtype: int64 Professional 0.276 Skilled Manual 0.255 Clerical 0.177 Management 0.173 Manual 0.119 Name: Occupation, dtype: float64 Yes 685 No 315 Name: Home Owner, dtype: int64 Yes 0.685 No 0.315 Name: Home Owner, dtype: float64 0-1 Miles 366 5-10 Miles 192 1-2 Miles 169 2-5 Miles 162 10+ Miles 111 Name: Commute Distance, dtype: int64 0-1 Miles 0.366 5-10 Miles 0.192 1-2 Miles 0.169 2-5 Miles 0.162 10+ Miles 0.111 Name: Commute Distance, dtype: float64 North America 508 Europe 300 Pacific 192 Name: Region, dtype: int64 North America 0.508 Europe 0.300 Pacific 0.192 Name: Region, dtype: float64 No 519 Yes 481 Name: Purchased Bike, dtype: int64 No 0.519 Yes 0.481 Name: Purchased Bike, dtype: float64 ``` Label Encoding --- ```python= categorical = [var for var in dataSet.columns if dataSet[var].dtype=='O'] label_encoder = preprocessing.LabelEncoder() dataSet['Marital Status'] = label_encoder.fit_transform(dataSet['Marital Status']) dataSet['Gender'] = label_encoder.fit_transform(dataSet['Gender']) dataSet['Education'] = label_encoder.fit_transform(dataSet['Education']) dataSet['Occupation'] = label_encoder.fit_transform(dataSet['Occupation']) dataSet['Home Owner'] = label_encoder.fit_transform(dataSet['Home Owner']) dataSet['Commute Distance'] = label_encoder.fit_transform(dataSet['Commute Distance']) dataSet['Region'] = label_encoder.fit_transform(dataSet['Region']) dataSet['Purchased Bike'] = label_encoder.fit_transform(dataSet['Purchased Bike']) dataSet.head() ``` ![](https://i.imgur.com/MOIAe7R.png) 分析Age數值分佈比例 --- ```python= dataSet['Age'].describe() ``` ``` count 1000.000000 mean 44.190000 std 11.353537 min 25.000000 25% 35.000000 50% 43.000000 75% 52.000000 max 89.000000 Name: Age, dtype: float64 ``` ```python= dataSet['Age'] = pd.cut(x = dataSet['Age'], bins = [0, 30, 40, 50, 60, 100, 150], labels = [0, 1, 2, 3, 4, 5]) dataSet['Age'] = dataSet['Age'].astype('int64') dataSet['Age'].isnull().sum() ``` ``` 0 ``` 分析Income數值分佈比例 --- ```python= dataSet['Income'].describe() ``` ``` count 1000.000000 mean 56140.000000 std 31081.609779 min 10000.000000 25% 30000.000000 50% 60000.000000 75% 70000.000000 max 170000.000000 Name: Income, dtype: float64 ``` ```python= dataSet['Income'] = pd.cut(x = dataSet['Income'], bins = [0, 30000, 50000, 75000, 100000, 150000, 200000], labels = [1, 2, 3, 4, 5, 6]) dataSet['Income'] = dataSet['Income'].astype('int64') dataSet['Income'].isnull().sum() ``` ``` 0 ``` ```python= dataSet[numerical].isnull().sum() ``` ``` Income 0 Children 0 Cars 0 Age 0 dtype: int64 ``` ```python= dataSet[categorical].isnull().sum() ``` ``` Marital Status 0 Gender 0 Education 0 Occupation 0 Home Owner 0 Commute Distance 0 Region 0 Purchased Bike 0 dtype: int64 ``` ```python= dataSet.dtypes ``` ``` Marital Status int64 Gender int64 Income int64 Children int64 Education int64 Occupation int64 Home Owner int64 Cars int64 Commute Distance int64 Region int64 Age int64 Purchased Bike int64 dtype: object ``` ```python= def loadData(ds): train_text = ds.drop(['Purchased Bike'], axis=1) label_text = ds['Purchased Bike'] return(train_text, label_text) ``` ```python= # Load dataset train_text, label_text = loadData(dataSet) ``` ```python= train_text.head() ``` ![](https://i.imgur.com/a4oYbRw.png) ```python= label_text.head() ``` ``` 0 0 1 0 2 0 3 1 4 1 Name: Purchased Bike, dtype: int64 ``` ```python= Train_x, Test_x, Train_y, Test_y = train_test_split(train_text, label_text, test_size=0.33, random_state=42) clf_gini = DecisionTreeClassifier(criterion='gini', max_depth=3) # Try max_depth = , min_samples_leaf = clf_gini.fit(Train_x, Train_y) gini_score = accuracy_score(Test_y,clf_gini.predict(Test_x)) clf_entr = DecisionTreeClassifier(criterion='entropy', max_depth=3) clf_entr.fit(Train_x, Train_y) entr_score = accuracy_score(Test_y,clf_entr.predict(Test_x)) print('Score of GINI =',gini_score,'\nScore of Entropy =',entr_score) ``` ``` Score of GINI = 0.6060606060606061 Score of Entropy = 0.6060606060606061 ``` ```python= # Decision Tree Classifier # instantiate dtc = DecisionTreeClassifier() # fit dtc.fit(Train_x, Train_y) # predict Pred_y = dtc.predict(Test_x) # print print('Accuracy score:{0:.3f}'.format(accuracy_score(Test_y, Pred_y))) print('Precision score:{0:.3f}'.format(precision_score(Test_y, Pred_y, average='weighted'))) print('Recall score:{0:0.3f}'.format(recall_score(Test_y, Pred_y, average='weighted'))) print('f1-score:{0:.3f}'.format(f1_score(Test_y, Pred_y, average='weighted'))) ``` ``` Accuracy score:0.664 Precision score:0.664 Recall score:0.664 f1-score:0.664 ``` ```python= print('Classification Report:\n', classification_report(Test_y, Pred_y)) ``` ``` Classification Report: precision recall f1-score support 0 0.68 0.65 0.66 167 1 0.65 0.68 0.67 163 accuracy 0.66 330 macro avg 0.66 0.66 0.66 330 weighted avg 0.66 0.66 0.66 330 ``` ```python= import matplotlib.pyplot as plt from sklearn import tree fn=['ID', 'Marital Status', 'Gender', 'Income', 'Children', 'Education', 'Occupation', 'Home Owner', 'Cars', 'Commute Distance', 'Region', 'Age'] cn=['Bought', 'Not Bought'] fig = plt.figure(figsize=(25,20)) _ = tree.plot_tree(clf_gini, feature_names = fn, class_names=cn, filled=True) # 決策樹可選clf_gini、clf_entr、dtc ``` ![](https://i.imgur.com/jAQgKMr.png) 分類模型評估 --- ```python= print('Accuracy Score : ' + str(accuracy_score(Test_y,Pred_y))) # Confusion matrix cm = confusion_matrix(Test_y,Pred_y) print(cm) ``` ``` Accuracy Score : 0.6636363636363637 [[108 59] [ 52 111]] ``` ```python= import seaborn as sns cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive', 'Actual Negative'], index=['Predict Positive', 'Predict Negative']) sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu') plt.show() ``` ![](https://i.imgur.com/4q0lMZ2.png) ```python= print(classification_report(Test_y, Pred_y)) ``` ``` precision recall f1-score support 0 0.68 0.65 0.66 167 1 0.65 0.68 0.67 163 accuracy 0.66 330 macro avg 0.66 0.66 0.66 330 weighted avg 0.66 0.66 0.66 330 ```