readme - HackMD

--- title: 'readme' disqus: hackmd --- 大數據與商業分析（Group 5） === HackMD好讀版：https://hackmd.io/@BZX5xmFVQ6-Y4RO6QBIJ_A/SkFmRHSyB ### Requirement 1. Python 3.6 2. Jupyter Notebook ### Required Packages 1. pandas 2. numpy 3. json 4. time, datetime 5. graphviz *建議使用brew install graphviz* 6. pydotplus, Image 7. pickle 8. sklearn ## 廣告效益分析(1) ### File Name Advertisement analysis 1.ipynb ### Preprocessing #### Behavior Data 移除behavior data中不需要的欄位： ```python BD = BD.drop(columns=['ProductId', 'SearchKeyWord','CategoryId','TransactionNum','TransactionRevenue', 'VisitorId', 'ProductQuantity']) ``` behavior data只保留Facebook、Instagram、Line、LineShopping四個廣告管道，以及ViewSalePage、Cart、Purchase三種行為，並移除含有Na值得資料: ```python BD = BD[BD['TrafficSourceCategory'] != 'Direct'] BD = BD[BD['TrafficSourceCategory'] != 'GoogleOrganic'] BD = BD[BD['TrafficSourceCategory'] != 'Email'] BD = BD[BD['TrafficSourceCategory'] != 'Others'] BD = BD[BD['BehaviorType'] != 'ViewSalePageCategory'] BD = BD[BD['BehaviorType'] != 'Fav'] BD = BD[BD['BehaviorType'] != 'Search'] BD = BD.dropna() ``` 新增行為年份、行為月份、行為時間等欄位資料： ```python BD['HitDateTime'] = pd.to_datetime(BD['HitDateTime'], errors = 'coerce') BD['buy_year'] = pd.DatetimeIndex(BD['HitDateTime']).year BD['buy_month'] = pd.DatetimeIndex(BD['HitDateTime']).month BD['buy_hour'] = pd.DatetimeIndex(BD['HitDateTime']).hour ``` #### Member Data 移除含有Na值的資料： ```python MD = MD.dropna() ``` 計算會員年齡以及註冊年數，並移除不合理的年齡資料： ```python MD['Birthday'] = pd.to_datetime(MD['Birthday'], errors = 'coerce') MD['birth_year'] = pd.DatetimeIndex(MD['Birthday']).year MD = MD[MD['birth_year'] > 1900] MD = MD[MD['birth_year'] < 2019] MD['age'] = 2019 - MD['birth_year'] MD['RegisterDate'] = pd.to_datetime(MD['RegisterDate'], errors = 'coerce') MD['year_of_reg'] = 2019 - pd.DatetimeIndex(MD['RegisterDate']).year ``` #### Combined Data 將OnlineMemberId設為Member data和Behavior data的Index，並合併兩表： ```python BD = BD.set_index('OnlineMemberId') MD = MD.set_index('OnlineMemberId') # combine BD and MD data = BD.join(MD) ``` 新增Purchase欄位： ```python Purchase = [] for behavior in data['BehaviorType']: if behavior == 'Purchase': Purchase.append(1) else: Purchase.append(0) data['Purchase'] = Purchase del Purchase ``` 因未購買資料比數遠大於購買資料，為避免模型偏頗，因此對未購買資料進行隨機抽樣，使兩者比例為3:1 ```python not_buy = data[data['Purchase'] == 0] not_buy_sample = not_buy.sample(frac=0.06) data = data[data['Purchase'] == 1] data = pd.concat([data, not_buy_sample], axis=0) del not_buy, noy_buy_sample ``` #### Create Dummy variables 將類別變數調整成Dummy variables，並新增對應欄位： ```python final_train = pd.get_dummies(data, columns=['TrafficSourceCategory','RegisterSourceTypeDef', 'Gender', 'SourceType','MemberCardLevel']) ``` #### 切割資料集 ```python X = final_train.loc[:, final_train.columns != 'Purchase'] y = final_train.loc[:, final_train.columns == 'Purchase'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0) ``` ### Model Logistic Regression * cols為一包含所有欲加進Model中之變數名稱的list，list中的變數名稱必須和final_train中的欄位名稱完全相同，亦可自行刪減變數。 ```python cols = X_train.columns X = X_train[cols] y = y_train['Purchase'] ``` ### Output Fit the regression and print summary report: ```python logit_model=sm.Logit(y,X) result=logit_model.fit() print(result.summary2()) ``` ## 廣告效益分析(2) ### File Name Advertisement analysis 2.ipynb ### Preprocessing #### A.移除非會員資料 ```python non_member=[] for i in range(len(fb_view)): online_id=fb_view.iloc[i]['OnlineMemberId'] if str(online_id)=='nan': non_member.append(i) fb_member_view=fb_view.drop(fb_view.index[non_member]) ``` #### B.建立一個dictionary，將OnlineMemberID對應到UUID ```python online2uuid=dict() for i in range(len(Member)): onlineID=Member.iloc[i]['OnlineMemberId'] if onlineID not in online2uuid.keys(): online2uuid[onlineID]=Member.iloc[i]['UUID'] ``` 計算客人的年紀（age） ```python def calculate_age(born): today = date.today() return today.year - born.year - ((today.month, today.day) < (born.month, born.day)) ``` #### C.記下每筆交易資料發生的月份 ```python for i in range(len(Order)): #將日期從string的形式轉成datetime strdate=Order.iloc[i]['TradesDate'] datee = datetime.datetime.strptime(strdate, "%Y/%m/%d") month=datee.month month_column.append(month) ``` #### D.新增“Buy”欄位，對應至Order data，若客人此次瀏覽後兩週內有購買行為，則將Buy設為1，未購買則設為0 ```python #針對每次的瀏覽，去找他後來是否有買 #second_sample是從behavioral data中，隨機抽取從FB導流，進入Web/APP從事瀏覽行為的資料。並將second_sample以UUID進行排序，這樣在查詢此人是否有購買紀錄時，只要檢查完自己UUID的資料即可確認，縮短判斷時間 for i in range(len(second_sample)): print('number {}'.format(i)) onlineID=second_sample.iloc[i]['OnlineMemberId'] if onlineID not in online2uuid.keys(): buy.append(0) else: uuid=online2uuid[onlineID] #將OnlineMemberID轉成UUID find_person=False view_time=second_sample.iloc[i]['datetime'] for j in range(len(new_order_noreturn)): if new_order_noreturn.iloc[j]['UUID']==uuid: find_person=True buy_time=new_order_noreturn.iloc[j]['datetime'] dis=(buy_time-view_time).days if dis>=0 and dis<=14: #判斷是否在兩週內有購買 buy.append(1) break elif j==len(new_order_noreturn)-1: buy.append(0) break elif find_person==True: #此情況表示檢查完自己UUID的資料了，但是沒有一筆是在瀏覽後的14天內 buy.append(0) break elif j==len(new_order_noreturn)-1 and find_person==False: #此情況表示此人沒有任何的購買紀錄，以致於find_person的值仍為True buy.append(0) break ``` ### Split Train and Test Data ```python X_train, X_test, y_train, y_test = train_test_split(X_second, y_second, test_size=0.2, random_state=42) ``` ### Model and Output PartA. Logistic Regression，瞭解不同解釋變數對於購買與否的影響 * cols為一包含欲加進Model中之變數名稱的list，list中的變數名稱必須和second_sample中的欄位名稱完全相同，亦可自行刪減變數。 ```python three_cols=['Female','Age','APP','SessionNumber','ProductPrice','month','frequency','Qty','average'] X = X_train[cols] y = y_train['Buy'] Fit the regression and print summary report: ```python logit_model=sm.Logit(y,X) result=logit_model.fit() print(result.summary2()) ``` PartB. Naive Bayse,SVM,Random Forest,KNN，依照所選用的X變數預測瀏覽商品後兩週內是否購買 ```python print("===== NB =====") gnb = GaussianNB() gnb.fit(X_train, y_train) result2 = gnb.predict(X_test) print("===== SVM =====") my_svm = SVC() my_svm.fit(X_train, y_train) result1 = my_svm.predict(X_test) print("===== RF =====") clf = RandomForestClassifier() clf.fit(X_train, y_train) result4 = clf.predict(X_test) print("===== KNN =====") neigh = KNeighborsClassifier() neigh.fit(X_train, y_train) result3 = neigh.predict(X_test) ``` ### 消費者行為分析 ### File Name customer_behavior_analysis.ipynb ### Setting 確認以下設定後，即可“Restart and Run All” * 引入檔案需注意檔案路徑 * 呼叫draw_tree()輸出圖檔時，需注意資料夾存在 ### Model * 使用的 model 如果要新增其他model，則要同時增加於names和classifiers ```python names = [ "Naive Bayes", "Random Forest", "Decision Tree", "AdaBoost" ] classifiers = [ GaussianNB(), RandomForestClassifier(max_depth=5, n_estimators=16, max_features=3, random_state=5), DecisionTreeClassifier(max_depth=5, random_state=5), AdaBoostClassifier(), ] ``` * Training and Testing ``` python for name, clf in zip(names, classifiers): clf.fit(X_train, y_train) y_predict = clf.predict(X_test) evals = precision_recall_fscore_support(y_test, y_predict, average='weighted') precision, recall, f1score = "{:.4f}".format(evals[0]), "{:.4f}".format(evals[1]), "{:.4f}".format(evals[2]) if name == "Random Forest": draw_tree(clf, "./tree chart/rf({}).png".format(n)) ``` * Plot the Tree ```python def draw_tree(clf, output_name): dot_data = tree.export_graphviz( clf.estimators_[0], out_file=None, feature_names=limit_vocab, class_names={0: "Not Buy", 1: "Buy"}, filled=True, rounded=True, special_characters=True ) graph = pydotplus.graph_from_dot_data(dot_data) graph.write_png(output_name) print(output_name, "saved!") ``` ### Output * png files (for the first estimator of Random Forest result) * evaluations.csv (record all results)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.