---
tags: Data Mining, Python
disqus: HackMD
---
資料探勘 Homework #2
===
Classification
---
* Goal
* Understand what classification systems do and the difference between real behavior of classification model and observed data
* Description
* Construct a classification model to observe the difference between real ‘right’ data and modeled data
Flow
---
* Step 1: Design a set of rules to classify data, e.g., classify students with good performance.
* You should design k features/attributes for your problems first.
* Use ‘absolutely right’ rules to generate your positive and negative data (the number of data = M)
* Step 2: Use the data generated in Step 1 to construct your classification model
* Decision tree is basic requirement, you can add more classification models.
* Step 3: Compare the rules in the decision tree from Step 2 and the rules you used to generate your ‘right’ data
* Step 4: Discuss anything you can
Example
---
* Select a good apple
* Your “absolute right” rule (R)
* Color: dark red
* Knock voice: sharp
* Head color: green
* Weight: medium (hidden)
* Use (R) to generate your data
* Add more attributes (20+ is better)
* Use classifiers to classify your data
* Decision tree
* Naïve bayes
* ...
```python=
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, average_precision_score, precision_score, recall_score, f1_score, classification_report
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
base_path = ''
file_path = base_path + 'data/bike_buyers_clean.csv'
dataSet = pd.read_csv(file_path, sep=',')
dataSet.head()
```

```python=
dataSet.shape
```
(1000, 13)
刪除無意義欄位
---
```python=
dataSet = dataSet.drop('ID', axis=1)
dataSet.head()
```

```python=
dataSet.shape
```
(1000, 12)
分析數值型資料欄位
---
```python=
numerical = [var for var in dataSet.columns if dataSet[var].dtype!='O']
dataSet[numerical].head()
```

分析非數值型欄位
---
```python=
categorical = [var for var in dataSet.columns if dataSet[var].dtype=='O']
dataSet[categorical].head()
```

分析分類變數出現頻率
---
```python=
for var in categorical:
print(dataSet[var].value_counts())
print(dataSet[var].value_counts()/np.float(len(dataSet)))
print()
```
```
Married 539
Single 461
Name: Marital Status, dtype: int64
Married 0.539
Single 0.461
Name: Marital Status, dtype: float64
Male 509
Female 491
Name: Gender, dtype: int64
Male 0.509
Female 0.491
Name: Gender, dtype: float64
Bachelors 306
Partial College 265
High School 179
Graduate Degree 174
Partial High School 76
Name: Education, dtype: int64
Bachelors 0.306
Partial College 0.265
High School 0.179
Graduate Degree 0.174
Partial High School 0.076
Name: Education, dtype: float64
Professional 276
Skilled Manual 255
Clerical 177
Management 173
Manual 119
Name: Occupation, dtype: int64
Professional 0.276
Skilled Manual 0.255
Clerical 0.177
Management 0.173
Manual 0.119
Name: Occupation, dtype: float64
Yes 685
No 315
Name: Home Owner, dtype: int64
Yes 0.685
No 0.315
Name: Home Owner, dtype: float64
0-1 Miles 366
5-10 Miles 192
1-2 Miles 169
2-5 Miles 162
10+ Miles 111
Name: Commute Distance, dtype: int64
0-1 Miles 0.366
5-10 Miles 0.192
1-2 Miles 0.169
2-5 Miles 0.162
10+ Miles 0.111
Name: Commute Distance, dtype: float64
North America 508
Europe 300
Pacific 192
Name: Region, dtype: int64
North America 0.508
Europe 0.300
Pacific 0.192
Name: Region, dtype: float64
No 519
Yes 481
Name: Purchased Bike, dtype: int64
No 0.519
Yes 0.481
Name: Purchased Bike, dtype: float64
```
Label Encoding
---
```python=
categorical = [var for var in dataSet.columns if dataSet[var].dtype=='O']
label_encoder = preprocessing.LabelEncoder()
dataSet['Marital Status'] = label_encoder.fit_transform(dataSet['Marital Status'])
dataSet['Gender'] = label_encoder.fit_transform(dataSet['Gender'])
dataSet['Education'] = label_encoder.fit_transform(dataSet['Education'])
dataSet['Occupation'] = label_encoder.fit_transform(dataSet['Occupation'])
dataSet['Home Owner'] = label_encoder.fit_transform(dataSet['Home Owner'])
dataSet['Commute Distance'] = label_encoder.fit_transform(dataSet['Commute Distance'])
dataSet['Region'] = label_encoder.fit_transform(dataSet['Region'])
dataSet['Purchased Bike'] = label_encoder.fit_transform(dataSet['Purchased Bike'])
dataSet.head()
```

分析Age數值分佈比例
---
```python=
dataSet['Age'].describe()
```
```
count 1000.000000
mean 44.190000
std 11.353537
min 25.000000
25% 35.000000
50% 43.000000
75% 52.000000
max 89.000000
Name: Age, dtype: float64
```
```python=
dataSet['Age'] = pd.cut(x = dataSet['Age'], bins = [0, 30, 40, 50, 60, 100, 150], labels = [0, 1, 2, 3, 4, 5])
dataSet['Age'] = dataSet['Age'].astype('int64')
dataSet['Age'].isnull().sum()
```
```
0
```
分析Income數值分佈比例
---
```python=
dataSet['Income'].describe()
```
```
count 1000.000000
mean 56140.000000
std 31081.609779
min 10000.000000
25% 30000.000000
50% 60000.000000
75% 70000.000000
max 170000.000000
Name: Income, dtype: float64
```
```python=
dataSet['Income'] = pd.cut(x = dataSet['Income'], bins = [0, 30000, 50000, 75000, 100000, 150000, 200000], labels = [1, 2, 3, 4, 5, 6])
dataSet['Income'] = dataSet['Income'].astype('int64')
dataSet['Income'].isnull().sum()
```
```
0
```
```python=
dataSet[numerical].isnull().sum()
```
```
Income 0
Children 0
Cars 0
Age 0
dtype: int64
```
```python=
dataSet[categorical].isnull().sum()
```
```
Marital Status 0
Gender 0
Education 0
Occupation 0
Home Owner 0
Commute Distance 0
Region 0
Purchased Bike 0
dtype: int64
```
```python=
dataSet.dtypes
```
```
Marital Status int64
Gender int64
Income int64
Children int64
Education int64
Occupation int64
Home Owner int64
Cars int64
Commute Distance int64
Region int64
Age int64
Purchased Bike int64
dtype: object
```
```python=
def loadData(ds):
train_text = ds.drop(['Purchased Bike'], axis=1)
label_text = ds['Purchased Bike']
return(train_text, label_text)
```
```python=
# Load dataset
train_text, label_text = loadData(dataSet)
```
```python=
train_text.head()
```

```python=
label_text.head()
```
```
0 0
1 0
2 0
3 1
4 1
Name: Purchased Bike, dtype: int64
```
```python=
Train_x, Test_x, Train_y, Test_y = train_test_split(train_text, label_text, test_size=0.33, random_state=42)
clf_gini = DecisionTreeClassifier(criterion='gini', max_depth=3) # Try max_depth = , min_samples_leaf =
clf_gini.fit(Train_x, Train_y)
gini_score = accuracy_score(Test_y,clf_gini.predict(Test_x))
clf_entr = DecisionTreeClassifier(criterion='entropy', max_depth=3)
clf_entr.fit(Train_x, Train_y)
entr_score = accuracy_score(Test_y,clf_entr.predict(Test_x))
print('Score of GINI =',gini_score,'\nScore of Entropy =',entr_score)
```
```
Score of GINI = 0.6060606060606061
Score of Entropy = 0.6060606060606061
```
```python=
# Decision Tree Classifier
# instantiate
dtc = DecisionTreeClassifier()
# fit
dtc.fit(Train_x, Train_y)
# predict
Pred_y = dtc.predict(Test_x)
# print
print('Accuracy score:{0:.3f}'.format(accuracy_score(Test_y, Pred_y)))
print('Precision score:{0:.3f}'.format(precision_score(Test_y, Pred_y, average='weighted')))
print('Recall score:{0:0.3f}'.format(recall_score(Test_y, Pred_y, average='weighted')))
print('f1-score:{0:.3f}'.format(f1_score(Test_y, Pred_y, average='weighted')))
```
```
Accuracy score:0.664
Precision score:0.664
Recall score:0.664
f1-score:0.664
```
```python=
print('Classification Report:\n', classification_report(Test_y, Pred_y))
```
```
Classification Report:
precision recall f1-score support
0 0.68 0.65 0.66 167
1 0.65 0.68 0.67 163
accuracy 0.66 330
macro avg 0.66 0.66 0.66 330
weighted avg 0.66 0.66 0.66 330
```
```python=
import matplotlib.pyplot as plt
from sklearn import tree
fn=['ID', 'Marital Status', 'Gender', 'Income', 'Children', 'Education',
'Occupation', 'Home Owner', 'Cars', 'Commute Distance', 'Region', 'Age']
cn=['Bought', 'Not Bought']
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(clf_gini, feature_names = fn, class_names=cn, filled=True) # 決策樹可選clf_gini、clf_entr、dtc
```

分類模型評估
---
```python=
print('Accuracy Score : ' + str(accuracy_score(Test_y,Pred_y)))
# Confusion matrix
cm = confusion_matrix(Test_y,Pred_y)
print(cm)
```
```
Accuracy Score : 0.6636363636363637
[[108 59]
[ 52 111]]
```
```python=
import seaborn as sns
cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive', 'Actual Negative'],
index=['Predict Positive', 'Predict Negative'])
sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu')
plt.show()
```

```python=
print(classification_report(Test_y, Pred_y))
```
```
precision recall f1-score support
0 0.68 0.65 0.66 167
1 0.65 0.68 0.67 163
accuracy 0.66 330
macro avg 0.66 0.66 0.66 330
weighted avg 0.66 0.66 0.66 330
```