大數據行銷作業三 b05504066 李旻翰

大數據行銷作業三 b05504066 李旻翰 === ```python ``` ## HW3 b05504066 李旻翰 ```python import pandas as pd import numpy as np %config Completer.use_jedi = False %config IPCompleter.greedy=True pd.set_option('display.max_rows', 200) df = pd.read_excel("Hw3.xlsx",sheet_name='step1') # df.dropna(axis='columns') # np.where(df['Customer_ID'].notnull())[0].shape df ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Customer_ID</th> <th>Date</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>92</td> <td>20060101</td> </tr> <tr> <th>1</th> <td>198</td> <td>20060102</td> </tr> <tr> <th>2</th> <td>338</td> <td>20060104</td> </tr> <tr> <th>3</th> <td>338</td> <td>20060104</td> </tr> <tr> <th>4</th> <td>338</td> <td>20060104</td> </tr> <tr> <th>...</th> <td>...</td> <td>...</td> </tr> <tr> <th>3072</th> <td>8202</td> <td>20071230</td> </tr> <tr> <th>3073</th> <td>8202</td> <td>20071230</td> </tr> <tr> <th>3074</th> <td>6828</td> <td>20071231</td> </tr> <tr> <th>3075</th> <td>7854</td> <td>20071231</td> </tr> <tr> <th>3076</th> <td>7854</td> <td>20071231</td> </tr> </tbody> </table> <p>3077 rows × 2 columns</p> </div> ## Step 1 Remove the duplicated records and change the format of date ```python # remove the duplicates record df['Customer_ID'] = pd.Series([int(i) for i in df['Customer_ID']]) df['Date'] = pd.to_datetime(df['Date'],format='%Y%m%d').dt.date df1 = df.drop_duplicates(ignore_index=True) df1 ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Customer_ID</th> <th>Date</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>92</td> <td>2006-01-01</td> </tr> <tr> <th>1</th> <td>198</td> <td>2006-01-02</td> </tr> <tr> <th>2</th> <td>338</td> <td>2006-01-04</td> </tr> <tr> <th>3</th> <td>527</td> <td>2006-01-05</td> </tr> <tr> <th>4</th> <td>62</td> <td>2006-01-05</td> </tr> <tr> <th>...</th> <td>...</td> <td>...</td> </tr> <tr> <th>1810</th> <td>7854</td> <td>2007-12-30</td> </tr> <tr> <th>1811</th> <td>7923</td> <td>2007-12-30</td> </tr> <tr> <th>1812</th> <td>8202</td> <td>2007-12-30</td> </tr> <tr> <th>1813</th> <td>6828</td> <td>2007-12-31</td> </tr> <tr> <th>1814</th> <td>7854</td> <td>2007-12-31</td> </tr> </tbody> </table> <p>1815 rows × 2 columns</p> </div> ## Step 2 sort, remove the low frequent customer (<3) ```python # first sort the value by id and date df2= df1.sort_values(by=['Customer_ID','Date'],ignore_index=True) # remove the low frequncy id (can't do CRI with appearance < 3 ) valuecount=df2['Customer_ID'].value_counts() to_remove = valuecount[valuecount<3].index # drop the index (reset_index to update the index ) df2.drop(df2.query('Customer_ID in @to_remove').index,inplace=True) df2.reset_index(drop=True,inplace=True) # calculate the interval between date, if different id, append 9999 a=[] for i in range(0,len(df2)-1): if df2.iloc[i,0] == df2.iloc[i+1,0]: a.append((df2.iloc[i+1,1]-df2.iloc[i,1]).days) else: a.append(9999) a.append(9999) #remove 9999 df2['int'] = pd.Series(a) df2.drop(df2.query('int==9999').index,inplace=True) df2.reset_index(drop=True,inplace=True) # df2 df2 ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Customer_ID</th> <th>Date</th> <th>int</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>62</td> <td>2006-01-05</td> <td>50</td> </tr> <tr> <th>1</th> <td>62</td> <td>2006-02-24</td> <td>40</td> </tr> <tr> <th>2</th> <td>62</td> <td>2006-04-05</td> <td>234</td> </tr> <tr> <th>3</th> <td>62</td> <td>2006-11-25</td> <td>15</td> </tr> <tr> <th>4</th> <td>62</td> <td>2006-12-10</td> <td>27</td> </tr> <tr> <th>...</th> <td>...</td> <td>...</td> <td>...</td> </tr> <tr> <th>1588</th> <td>7854</td> <td>2007-12-18</td> <td>1</td> </tr> <tr> <th>1589</th> <td>7854</td> <td>2007-12-19</td> <td>4</td> </tr> <tr> <th>1590</th> <td>7854</td> <td>2007-12-23</td> <td>5</td> </tr> <tr> <th>1591</th> <td>7854</td> <td>2007-12-28</td> <td>2</td> </tr> <tr> <th>1592</th> <td>7854</td> <td>2007-12-30</td> <td>1</td> </tr> </tbody> </table> <p>1593 rows × 3 columns</p> </div> ## Step 3 Calculate the weight and prepared to calculate MLE ```python # need to copy, otherwise it would affect df2 df3 = df2.copy() # add weight, the closest date has the highest weight id = df3.iloc[0,0] count = 0 weight=[] id = df3.iloc[count,0] i=1 while(count<len(df3)): if df3.iloc[count,0] ==id: weight.append(i) i+=1 count+=1 else: id = df3.iloc[count,0] i=1 # multiply the date interval and weight to get the weighted data df3['weight']= pd.Series(weight) df3['intXweight'] = pd.Series([df3.iloc[i]['int']*df3.iloc[i]['weight']for i in range(len(df3))]) df3 ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>Customer_ID</th> <th>Date</th> <th>int</th> <th>weight</th> <th>intXweight</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>62</td> <td>2006-01-05</td> <td>50</td> <td>1</td> <td>50</td> </tr> <tr> <th>1</th> <td>62</td> <td>2006-02-24</td> <td>40</td> <td>2</td> <td>80</td> </tr> <tr> <th>2</th> <td>62</td> <td>2006-04-05</td> <td>234</td> <td>3</td> <td>702</td> </tr> <tr> <th>3</th> <td>62</td> <td>2006-11-25</td> <td>15</td> <td>4</td> <td>60</td> </tr> <tr> <th>4</th> <td>62</td> <td>2006-12-10</td> <td>27</td> <td>5</td> <td>135</td> </tr> <tr> <th>...</th> <td>...</td> <td>...</td> <td>...</td> <td>...</td> <td>...</td> </tr> <tr> <th>1588</th> <td>7854</td> <td>2007-12-18</td> <td>1</td> <td>26</td> <td>26</td> </tr> <tr> <th>1589</th> <td>7854</td> <td>2007-12-19</td> <td>4</td> <td>27</td> <td>108</td> </tr> <tr> <th>1590</th> <td>7854</td> <td>2007-12-23</td> <td>5</td> <td>28</td> <td>140</td> </tr> <tr> <th>1591</th> <td>7854</td> <td>2007-12-28</td> <td>2</td> <td>29</td> <td>58</td> </tr> <tr> <th>1592</th> <td>7854</td> <td>2007-12-30</td> <td>1</td> <td>30</td> <td>30</td> </tr> </tbody> </table> <p>1593 rows × 5 columns</p> </div> ## Step 4 store it to excel and do the PivotTable analysis in excel ```python with pd.ExcelWriter('Hw3_new.xlsx') as writer: df1.to_excel(writer,sheet_name='step1',index=False) df2.to_excel(writer,sheet_name='step2',index=False) df3.to_excel(writer,sheet_name='step3',index=False) ``` ##Step 5 After finish calculating CAI, calculate 漸趨活躍群，穩定消費群，漸趨靜止群， these three groups'average buying interval and amounts. Read from HW3_final.xlsx ```python df5 = pd.read_excel("HW3_final.xlsx",sheet_name='Relative cumulative frequency').iloc[0:-2,0:2] # 20% = 38 80% =112 from diagram in sheet # get the id list of each group lowFre = df5.iloc[:39]['Customer_ID'] midFre = df5.iloc[39:113]['Customer_ID'] highFre = df5.iloc[113:]['Customer_ID'] #calculate the average time interval lowInterval = df3.iloc[df3.query('Customer_ID in @lowFre').index]['int'].mean() midInterval = df3.iloc[df3.query('Customer_ID in @midFre').index]['int'].mean() highInterval = df3.iloc[df3.query('Customer_ID in @highFre').index]['int'].mean() #calculate the average monetary df6 = pd.read_csv(r'..\real_time_reports.csv') lowMonetary = df6.iloc[df6.query('Customer_ID in @lowFre').index]['Amount'].mean() midMonetary = df6.iloc[df6.query('Customer_ID in @midFre').index]['Amount'].mean() highMonetary = df6.iloc[df6.query('Customer_ID in @highFre').index]['Amount'].mean() # make the table df_cai = pd.DataFrame({'CAI群別':['漸趨活卻群','穩定消費群','漸趨靜止群'], '客戶人數':[38,75,37], '比例':['25%','50%','25%'], '消費日平均消費金額':[highMonetary,midMonetary,lowMonetary], '平均消費間隔天數':[highInterval,midInterval,lowInterval]}) df_cai ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>CAI群別</th> <th>客戶人數</th> <th>比例</th> <th>消費日平均消費金額</th> <th>平均消費間隔天數</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>漸趨活卻群</td> <td>38</td> <td>25%</td> <td>1859.506294</td> <td>50.488127</td> </tr> <tr> <th>1</th> <td>穩定消費群</td> <td>75</td> <td>50%</td> <td>1460.404591</td> <td>49.285714</td> </tr> <tr> <th>2</th> <td>漸趨靜止群</td> <td>37</td> <td>25%</td> <td>1673.435443</td> <td>41.627002</td> </tr> </tbody> </table> </div> ###### tags: `大數據行銷`