---
title: 'readme'
disqus: hackmd
---
大數據與商業分析(Group 5)
===
HackMD好讀版:https://hackmd.io/@BZX5xmFVQ6-Y4RO6QBIJ_A/SkFmRHSyB
### Requirement
1. Python 3.6
2. Jupyter Notebook
### Required Packages
1. pandas
2. numpy
3. json
4. time, datetime
5. graphviz *建議使用brew install graphviz*
6. pydotplus, Image
7. pickle
8. sklearn
## 廣告效益分析(1)
### File Name
Advertisement analysis 1.ipynb
### Preprocessing
#### Behavior Data
移除behavior data中不需要的欄位:
```python
BD = BD.drop(columns=['ProductId', 'SearchKeyWord','CategoryId','TransactionNum','TransactionRevenue',
'VisitorId', 'ProductQuantity'])
```
behavior data只保留Facebook、Instagram、Line、LineShopping四個廣告管道,以及ViewSalePage、Cart、Purchase三種行為,並移除含有Na值得資料:
```python
BD = BD[BD['TrafficSourceCategory'] != 'Direct']
BD = BD[BD['TrafficSourceCategory'] != 'GoogleOrganic']
BD = BD[BD['TrafficSourceCategory'] != 'Email']
BD = BD[BD['TrafficSourceCategory'] != 'Others']
BD = BD[BD['BehaviorType'] != 'ViewSalePageCategory']
BD = BD[BD['BehaviorType'] != 'Fav']
BD = BD[BD['BehaviorType'] != 'Search']
BD = BD.dropna()
```
新增行為年份、行為月份、行為時間等欄位資料:
```python
BD['HitDateTime'] = pd.to_datetime(BD['HitDateTime'], errors = 'coerce')
BD['buy_year'] = pd.DatetimeIndex(BD['HitDateTime']).year
BD['buy_month'] = pd.DatetimeIndex(BD['HitDateTime']).month
BD['buy_hour'] = pd.DatetimeIndex(BD['HitDateTime']).hour
```
#### Member Data
移除含有Na值的資料:
```python
MD = MD.dropna()
```
計算會員年齡以及註冊年數,並移除不合理的年齡資料:
```python
MD['Birthday'] = pd.to_datetime(MD['Birthday'], errors = 'coerce')
MD['birth_year'] = pd.DatetimeIndex(MD['Birthday']).year
MD = MD[MD['birth_year'] > 1900]
MD = MD[MD['birth_year'] < 2019]
MD['age'] = 2019 - MD['birth_year']
MD['RegisterDate'] = pd.to_datetime(MD['RegisterDate'], errors = 'coerce')
MD['year_of_reg'] = 2019 - pd.DatetimeIndex(MD['RegisterDate']).year
```
#### Combined Data
將OnlineMemberId設為Member data和Behavior data的Index,並合併兩表:
```python
BD = BD.set_index('OnlineMemberId')
MD = MD.set_index('OnlineMemberId')
# combine BD and MD
data = BD.join(MD)
```
新增Purchase欄位:
```python
Purchase = []
for behavior in data['BehaviorType']:
if behavior == 'Purchase':
Purchase.append(1)
else:
Purchase.append(0)
data['Purchase'] = Purchase
del Purchase
```
因未購買資料比數遠大於購買資料,為避免模型偏頗,因此對未購買資料進行隨機抽樣,使兩者比例為3:1
```python
not_buy = data[data['Purchase'] == 0]
not_buy_sample = not_buy.sample(frac=0.06)
data = data[data['Purchase'] == 1]
data = pd.concat([data, not_buy_sample], axis=0)
del not_buy, noy_buy_sample
```
#### Create Dummy variables
將類別變數調整成Dummy variables,並新增對應欄位:
```python
final_train = pd.get_dummies(data, columns=['TrafficSourceCategory','RegisterSourceTypeDef',
'Gender', 'SourceType','MemberCardLevel'])
```
#### 切割資料集
```python
X = final_train.loc[:, final_train.columns != 'Purchase']
y = final_train.loc[:, final_train.columns == 'Purchase']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
```
### Model
Logistic Regression
* cols為一包含所有欲加進Model中之變數名稱的list,list中的變數名稱必須和final_train中的欄位名稱完全相同,亦可自行刪減變數。
```python
cols = X_train.columns
X = X_train[cols]
y = y_train['Purchase']
```
### Output
Fit the regression and print summary report:
```python
logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary2())
```
## 廣告效益分析(2)
### File Name
Advertisement analysis 2.ipynb
### Preprocessing
#### A.移除非會員資料
```python
non_member=[]
for i in range(len(fb_view)):
online_id=fb_view.iloc[i]['OnlineMemberId']
if str(online_id)=='nan':
non_member.append(i)
fb_member_view=fb_view.drop(fb_view.index[non_member])
```
#### B.建立一個dictionary,將OnlineMemberID對應到UUID
```python
online2uuid=dict()
for i in range(len(Member)):
onlineID=Member.iloc[i]['OnlineMemberId']
if onlineID not in online2uuid.keys():
online2uuid[onlineID]=Member.iloc[i]['UUID']
```
計算客人的年紀(age)
```python
def calculate_age(born):
today = date.today()
return today.year - born.year - ((today.month, today.day) < (born.month, born.day))
```
#### C.記下每筆交易資料發生的月份
```python
for i in range(len(Order)):
#將日期從string的形式轉成datetime
strdate=Order.iloc[i]['TradesDate']
datee = datetime.datetime.strptime(strdate, "%Y/%m/%d")
month=datee.month
month_column.append(month)
```
#### D.新增“Buy”欄位,對應至Order data,若客人此次瀏覽後兩週內有購買行為,則將Buy設為1,未購買則設為0
```python
#針對每次的瀏覽,去找他後來是否有買
#second_sample是從behavioral data中,隨機抽取從FB導流,進入Web/APP從事瀏覽行為的資料。並將second_sample以UUID進行排序,這樣在查詢此人是否有購買紀錄時,只要檢查完自己UUID的資料即可確認,縮短判斷時間
for i in range(len(second_sample)):
print('number {}'.format(i))
onlineID=second_sample.iloc[i]['OnlineMemberId']
if onlineID not in online2uuid.keys():
buy.append(0)
else:
uuid=online2uuid[onlineID] #將OnlineMemberID轉成UUID
find_person=False
view_time=second_sample.iloc[i]['datetime']
for j in range(len(new_order_noreturn)):
if new_order_noreturn.iloc[j]['UUID']==uuid:
find_person=True
buy_time=new_order_noreturn.iloc[j]['datetime']
dis=(buy_time-view_time).days
if dis>=0 and dis<=14:
#判斷是否在兩週內有購買
buy.append(1)
break
elif j==len(new_order_noreturn)-1:
buy.append(0)
break
elif find_person==True:
#此情況表示檢查完自己UUID的資料了,但是沒有一筆是在瀏覽後的14天內
buy.append(0)
break
elif j==len(new_order_noreturn)-1 and find_person==False:
#此情況表示此人沒有任何的購買紀錄,以致於find_person的值仍為True
buy.append(0)
break
```
### Split Train and Test Data
```python
X_train, X_test, y_train, y_test = train_test_split(X_second, y_second, test_size=0.2, random_state=42)
```
### Model and Output
PartA. Logistic Regression,瞭解不同解釋變數對於購買與否的影響
* cols為一包含欲加進Model中之變數名稱的list,list中的變數名稱必須和second_sample中的欄位名稱完全相同,亦可自行刪減變數。
```python
three_cols=['Female','Age','APP','SessionNumber','ProductPrice','month','frequency','Qty','average']
X = X_train[cols]
y = y_train['Buy']
Fit the regression and print summary report:
```python
logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary2())
```
PartB. Naive Bayse,SVM,Random Forest,KNN,依照所選用的X變數預測瀏覽商品後兩週內是否購買
```python
print("===== NB =====")
gnb = GaussianNB()
gnb.fit(X_train, y_train)
result2 = gnb.predict(X_test)
print("===== SVM =====")
my_svm = SVC()
my_svm.fit(X_train, y_train)
result1 = my_svm.predict(X_test)
print("===== RF =====")
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
result4 = clf.predict(X_test)
print("===== KNN =====")
neigh = KNeighborsClassifier()
neigh.fit(X_train, y_train)
result3 = neigh.predict(X_test)
```
### 消費者行為分析
### File Name
customer_behavior_analysis.ipynb
### Setting
確認以下設定後,即可“Restart and Run All”
* 引入檔案需注意檔案路徑
* 呼叫draw_tree()輸出圖檔時,需注意資料夾存在
### Model
* 使用的 model
如果要新增其他model,則要同時增加於names和classifiers
```python
names = [
"Naive Bayes",
"Random Forest",
"Decision Tree",
"AdaBoost"
]
classifiers = [
GaussianNB(),
RandomForestClassifier(max_depth=5, n_estimators=16, max_features=3, random_state=5),
DecisionTreeClassifier(max_depth=5, random_state=5),
AdaBoostClassifier(),
]
```
* Training and Testing
``` python
for name, clf in zip(names, classifiers):
clf.fit(X_train, y_train)
y_predict = clf.predict(X_test)
evals = precision_recall_fscore_support(y_test, y_predict, average='weighted')
precision, recall, f1score = "{:.4f}".format(evals[0]), "{:.4f}".format(evals[1]), "{:.4f}".format(evals[2])
if name == "Random Forest":
draw_tree(clf, "./tree chart/rf({}).png".format(n))
```
* Plot the Tree
```python
def draw_tree(clf, output_name):
dot_data = tree.export_graphviz(
clf.estimators_[0], out_file=None,
feature_names=limit_vocab, class_names={0: "Not Buy", 1: "Buy"},
filled=True, rounded=True, special_characters=True
)
graph = pydotplus.graph_from_dot_data(dot_data)
graph.write_png(output_name)
print(output_name, "saved!")
```
### Output
* png files (for the first estimator of Random Forest result)
* evaluations.csv (record all results)