[TOC] # 推薦系統期末作業 ## 模組導入 工具:jupyter notebook ```python import pandas as pd import numpy as np import math import matplotlib.pyplot as plt %matplotlib inline # 匯入資料處理 from sklearn.model_selection import train_test_split # 模型評估 from sklearn import metrics from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score from sklearn.metrics import classification_report,confusion_matrix from sklearn.model_selection import cross_val_score # 匯入自動標準化模組 from sklearn.preprocessing import StandardScaler # 匯入 smoke 模組 from imblearn.over_sampling import SMOTE from sklearn import svm ``` ## 資料描述 ### 資料導入 ```python= #u.data 評分資料 rating_header = ["user_id", "item_id", "rating", "timestamp"] rating = pd.read_csv("u.data.data", sep = '\t', header = None, names=rating_header) print(rating.head()) print('\n') rating.info() print("__________________________________________________________________________\n") #u.item 電影資料 movie_header = ["item_id", "title", "release_date", "video_release_date", "IMDb_URL", "unknown", "Action", "Adventure", "Animation","Children's", "Comedy", "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir", "Horror", "Musical", "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"] movies = pd.read_csv('u.item.item', sep = '|', header = None, encoding = 'latin1', names = movie_header) print(movies.head()) print('\n') movies.info() print("__________________________________________________________________________\n") #u.user 使用者資料 user_header = ["user_id", "age", "gender", "occupation", "zip_code"] user = pd.read_csv('u.user.user', sep = '|', header = None, encoding = 'latin1', names = user_header) print(user.head()) print('\n') user.info() print("__________________________________________________________________________\n") ``` ### 資料描述 #### 年齡 ```python user.hist('age') ``` ![](https://i.imgur.com/P2RcFII.png) #### 職業 學生較多 ```python occupation_count = user[["user_id", "occupation"]].groupby("occupation", as_index=False).size() plt.pie(occupation_count["size"], labels=occupation_count["occupation"]) plt.title("Occupation Distribution") plt.axis("equal") plt.show() ``` ![](https://i.imgur.com/awgQ0bJ.png) #### 性別 男性較多 ```python users_num.hist('gender') plt.title('Gender') ``` ![](https://i.imgur.com/csrPD2R.png) #### 相關性分析 ##### pearson 先進行資料前處理,最後找出與 rating 平均相關性分數高的項目 ```python #資料合併 #不確定要不要用清理後的rating 暫用原始rating all_data = pd.merge(pd.merge(user,rating), movies) # 移除字串和未使用欄位 all_data = all_data.drop(['title', 'video_release_date', 'release_date', 'IMDb_URL', 'unknown', 'timestamp', "zip_code"],axis=1) #職業 occupation = pd.read_csv("u.occupation.occupation", header = None) #有幾種職業 occupation_list = occupation.values all_data["occupation"].replace(occupation_list,list(range(0, len(occupation_list))), inplace=True) all_data["gender"].replace(['F', 'M'],[0, 1], inplace=True) all_data["rating"] = all_data["rating"].astype(int) # 資料關聯性 all_corr = all_data.corr().abs() label_name = 'rating' all_corr_higher_mean = all_corr > all_corr.median() corr_list = all_corr_higher_mean[label_name] corr_list = corr_list[corr_list == True].index.tolist() higher_mean_corr_list = [] for higher_mean_name in corr_list: higher_mean_corr_list.append({'name': higher_mean_name ,'value':all_corr[label_name][higher_mean_name]}) higher_mean_corr_list.sort(key=lambda k: (k.get('value', 0)), reverse = True) higher_mean_corr_list = higher_mean_corr_list[1:] higher_mean_corr_list ``` ![](https://i.imgur.com/JPfKzBB.png) :::success 電影項目、年齡、職業、電影種類,這四項相關性最高 ::: ### 資料前處理 #### 將性別和職業轉為數值 ```python #性別 users_num = user.copy() #映射 users_num["gender"].replace(['F', 'M'],[0, 1], inplace=True) #職業 occupation = pd.read_csv("u.occupation.occupation", header = None) #有幾種職業 occupation_list = occupation.values #映射 users_num["occupation"].replace(occupation_list,list(range(0, len(occupation_list))), inplace=True) ``` #### 將 rating outlier 移除 ```python #評分的box_plot #Clean outlier df = rating def box_plot(col_name): bp = df.boxplot(column=[col_name]) bp.plot() plt.show() def clean_outlier(col_name): q1, q2 , q3 = df[col_name].quantile([0.25, 0.5, 0.75]) IQR = q3-q1 lower_cap = q1 - 1.5*IQR upper_cap = q3 + 1.5*IQR df[col_name] = df[col_name].apply(lambda x: upper_cap if x > upper_cap else (lower_cap if (x<lower_cap) else x)) box_plot('rating') clean_outlier('rating') box_plot('rating') #清理完後寫進"cleaned_rating.csv"中 df.to_csv("cleaned_rating.csv", index=False, encoding="utf8") ``` ![](https://i.imgur.com/vA8Udst.png) :::success rating 數值為 1 是 outlier ::: #### 資料標準化 ##### 針對 rating 標準化 ```python #rating標準化 #zscore users_rating_scale = rating.copy() #rating的mu mu = users_rating_scale['rating'].mean() #rating的std std = users_rating_scale['rating'].std() z_score_normalized = (users_rating_scale['rating'] - mu)/std users_rating_scale['rating'] = z_score_normalized #寫進rating_scaled.csv中 users_rating_scale.to_csv("rating_scaled.csv", index=False, encoding="utf8") users_rating_scale.head() ``` ![](https://i.imgur.com/rcoX9Q3.png) ### 資料整合計算 :::info 我們想知道的事情 ::: #### 每部電影的男女平均和中位數分數 ```python #按性別區分 計算每部電影的男女觀眾評分的平均值與中位數 mean_rating_gender = all_data.pivot_table(values ='rating', index ='title', columns ='gender', aggfunc =[np.mean,np.median]) #印出前10筆 mean_rating_gender.head(10) ``` ![](https://i.imgur.com/pV9UPLI.png) #### 男生平均最高分前五部對應去女生 ```python #男女觀眾平均評分前五 mean_rating_gender = all_data.pivot_table(values ='rating', index ='title', columns ='gender', aggfunc = 'mean') #男生平均前五 mean_rating_gender.sort_values(by='M', ascending = False).head() ``` ![](https://i.imgur.com/VOyIjat.png) :::success 男生最喜歡的電影,女生也喜歡的較少 ::: #### 女生平均最高分前五部對應去男生 ```python #男女觀眾平均評分前五 mean_rating_gender = all_data.pivot_table(values ='rating', index ='title', columns ='gender', aggfunc = 'mean') #女生平均前五 mean_rating_gender.sort_values(by='F', ascending = False).head() ``` ![](https://i.imgur.com/8G4Ao95.png) #### 男女喜愛差異最大的電影 ```python #男女品味衝突前十的電影 mean_rating_gender['diff'] = mean_rating_gender.F-mean_rating_gender.M mean_rating_gender.sort_values(by='diff', ascending = False).head(10) ``` ![](https://i.imgur.com/xIgOZIx.png) #### 評分次數越多越受歡迎 ```python #受歡迎排行(評分次數越多越受歡迎) rating_movie = all_data.groupby('title').size() #次數>100視為受歡迎 top_rating = rating_movie[rating_movie > 100] #前十名 top_10_rating = top_rating.sort_values(ascending =False) top_10_rating.head(10) ``` ![](https://i.imgur.com/a7WV9bs.png) #### 前20高分電影的評分 ```python #前20大高分電影 mean_rating = all_data.pivot_table(values = 'rating', index = 'title', aggfunc = 'mean') top_20_highscore = mean_rating.sort_values(by='rating', ascending=False) top_20_highscore.head(20) ``` ![](https://i.imgur.com/68Es2LG.png) #### 前20高分電影的評分次數 ```python #前20高分的評分程度->高分電影可能受眾很少 但分數給很高 rating_movie[top_20_highscore.index] ``` ![](https://i.imgur.com/cZUfQ3n.png) :::success 有些高分電影,受眾很少,但分數給很高 ::: ## 推薦方法 ### User-Base :::info 對於使用者已經有相當觀看電影紀錄,可以使用此方法推薦 ::: 找到與使用者前 50 名高相似度的使用者,將相似度代入他們喜愛的電影,加總做推薦 ```python= ''' 基於Pearson相關係數判斷資料集中其他用戶與目標用戶的相似性, 取最相似的50個用戶加權計算推薦係數, 排序後推薦得分最高的十部電影。 ''' import numpy as np import math def loadData(): f = open('u.data.data') data = [] for i in range(100000): h = f.readline().split('\t') h = list(map(int, h)) data.append(h[0:3]) f.close() return data def loadMovieName(): f = open('u.item.item', encoding = 'ISO-8859-1') name = [] for i in range(1682): h = f.readline() k='' m=0 for j in range(100): k+=str(h[j]) if str(h[j])=='|': m+=1 if m == 2: break name.append(k) f.close() return name #整理資料 每一行是一個user對所有電影的對應評分的一個表(943*1682的矩陣) def manageDate(data): outdata = [] for i in range(943): outdata.append([]) for j in range(1682): outdata[i].append(0) for h in data: outdata[h[0] - 1][h[1] - 1]= h[2] return outdata #先求list的平均值 def calcMean(x, y): sum_x = sum(x) sum_y = sum(y) n = len(x) x_mean = float(sum_x + 0.0)/n y_mean = float(sum_y + 0.0)/n return x_mean, y_mean #在算pearson相關係數 def calcPearson(x, y): x_mean, y_mean = calcMean(x, y) n = len(x) sumTop = 0.0 sumBottom = 0.0 x_pow = 0.0 y_pow = 0.0 for i in range(n): sumTop +=(x[i] - x_mean)*(y[i] - y_mean) for i in range(n): x_pow += math.pow(x[i] - x_mean, 2) for i in range(n): y_pow += math.pow(y[i] - y_mean, 2) sumBottom = math.sqrt(x_pow * y_pow) p = sumTop/sumBottom return p def calcAttribute(dataSet, num): prr = [] #獲取dataset行數和列數 n, m = np.shape(dataSet) #初始化特徵X和類別Y向量 x = [0] * m y = [0] * m y = dataSet[num - 1] #取得每個特徵的向量,並計算pearson存入列表中 for j in range(n): x = dataSet[j] prr.append(calcPearson(x, y)) return prr #取最相似的50用戶加權計算推薦係數 排序後推薦得分最高的10部電影 def choseMovie(outdata, num): prr = calcAttribute(outdata, num) list = [] mid = [] out_list = [] movie_rank = [] for i in range(1682): movie_rank.append([i, 0]) k = 0 for i in range(943): list.append([i, prr[i]]) for i in range(943): for j in range(942-i): if list[j][1]<list[j+1][1]: mid = list[j] list[j] = list[j+1] list[j+1] = mid for i in range(1, 51): print(i, list[i][1]) for j in range(0, 1682): movie_rank[j][1] = movie_rank[j][1]+outdata[list[i][0]][j]*list[i][1]/50 #排序 for i in range(1682): for j in range(1681 - i): if movie_rank[j][1]<movie_rank[j+1][1]: mid = movie_rank[j] movie_rank[j] = movie_rank[j+1] movie_rank[j+1] = mid #取前十 for i in range(1, 1682): if(outdata[num-1][movie_rank[i][0]]==0): mark = 0 for d in out_list: if d[0] == j: mark = 1 if mark!=1: k+=1 out_list.append(movie_rank[i]) if k == 10: break return movie_rank def printMovie(out_list, name): print('base on the data we think you may like these movies: ') for i in range(10): print(name[out_list[i][0]], "rank score:", out_list[i][1]) i_data = loadData() name = loadMovieName() out_data = manageDate(i_data) #user_id user = 100 out_list = choseMovie(out_data, user) printMovie(out_list, name) #print('end_______________\n') ``` ![](https://i.imgur.com/SURQe5U.png) ### Item-Base :::info 在使用者無資料,可以單純利用電影的cosine_similarity去推薦電影 或 對於使用者已經有相當觀看電影紀錄,使用此方法推薦 ::: 找到使用者喜愛的電影 n 部,將喜愛的電影相似度列表做加總,選取前 m 部,當使用者喜愛電影超過 n 部,則會隨機挑選,則不會重複推薦使用者一樣的電影。 ```python= import pandas as pd import numpy as np import numpy.matlib import random import sys from sklearn.metrics.pairwise import cosine_similarity # 輸入參數內容 # 1~968 target_user = 169 # 1~5 like_number_threshold = 3 # 15~25 compare_movie_threshold = 25 # 5~15 amount_of_recommended_movies = 10 # 資料載入 data_header = ["user_id", "item_id", "rating", "timestamp"] data_pd = pd.read_csv("u.data.data", sep = '\t', header = None, names=data_header) movie_header = ["item_id", "title", "release_date", "video_release_date", "IMDb_URL", "unknown", "Action", "Adventure", "Animation","Children's", "Comedy", "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir", "Horror", "Musical", "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"] movies = pd.read_csv('u.item.item', sep = '|', header = None, encoding = 'latin1', names = movie_header) print('資料載入完成') # 選出符合條件的電影 condition1 = data_pd['user_id'] == target_user condition2 = data_pd['rating'] > like_number_threshold user_like_pd = data_pd[condition1 & condition2] # 假如使用者沒有任何一部電影資料就回覆 空陣列 if(len(user_like_pd) == 0): sys.exit('強制結束'); # print('選出符合條件的電影完成', user_like_pd) def output_matrix(data_pd): # user-item martix 建構 user_id_max = data_pd['user_id'].max() item_id_max = data_pd['item_id'].max() rank_matrix = np.matlib.zeros((item_id_max, user_id_max)) # 對應 rating 填表 for i in range(len(data_pd)): data = data_pd.iloc[i] # id 都是從 1 開始所以要減一,才會對應到 rank_matrix[data['item_id'] - 1, data['user_id'] - 1] = data['rating'] print('rating matrix 填表完成') # movie_cos_sim movie_cos_sim = cosine_similarity(rank_matrix) print('cosine similarity 完成') return movie_cos_sim movie_cos_sim = output_matrix(data_pd) def sum_movie_cosine_similarity(user_like_pd, movie_cos_sim): pd_count_number = len(user_like_pd) print('使用者喜愛的電影數量: ', pd_count_number) total_movie_cosine_similarity = np.matlib.zeros((1, item_id_max)) # 假如使用者少於 compare_movie_threshold 部,就利用已有電影資料去加總 if(pd_count_number < compare_movie_threshold): print('使用全部加總取得所有電影相似值') for i in range(0, pd_count_number - 1): _item_id = user_like_pd.iloc[i]['item_id'] total_movie_cosine_similarity += movie_cos_sim[_item_id - 1] # 減一才是該對影對應的值 # 假如使用者有大於等於 compare_movie_threshold 部電影資料,隨機挑選 compare_movie_threshold 部加總 if(pd_count_number >= compare_movie_threshold): print('使用隨機加總取得所有電影相似值') for i in range(0, compare_movie_threshold): radom_number = random.randint(0,pd_count_number - 1) _item_id = user_like_pd.iloc[radom_number]['item_id'] print('item_id:', _item_id, movie_cos_sim[_item_id].sum()) total_movie_cosine_similarity += movie_cos_sim[_item_id - 1] # 減一才是該對影對應的值 total_movie_cosine_similarity_pd = pd.DataFrame(total_movie_cosine_similarity[0]) total_movie_cosine_similarity_pd = total_movie_cosine_similarity_pd.T return total_movie_cosine_similarity_pd total_movie_cosine_similarity_pd = sum_movie_cosine_similarity(user_like_pd, movie_cos_sim) print('找到使用者喜愛電影相似度加總') def movie_cos_without_user_like(user_like_pd, total_movie_cosine_similarity_pd): # 將看到的影片相似度轉為 0 user_like_item_list = user_like_pd['item_id'].values # 轉為 index 位置 user_like_item_list = user_like_item_list -1 for item_index in user_like_item_list: total_movie_cosine_similarity_pd[0][item_index] = 0 # 只取得 0 以上的內容 movie_similarity_pd = total_movie_cosine_similarity_pd[total_movie_cosine_similarity_pd > 0] movie_similarity_pd = movie_similarity_pd.dropna() return movie_similarity_pd clear_movie_cos = movie_cos_without_user_like(user_like_pd, total_movie_cosine_similarity_pd) # 排名高到低 clear_movie_cos = clear_movie_cos.sort_values(by=0, ascending=False) print('將使用者喜愛影片轉為0,並只保留0以上的值') # print('clear_movie_cos: \n', clear_movie_cos) top_movies = movies.filter(clear_movie_cos[0:amount_of_recommended_movies].index.tolist(), axis=0) print('取得前 amount_of_recommended_movies 部推薦電影') # print('top_movies: \n', top_movies) recommands = pd.merge(top_movies, clear_movie_cos, left_index=True, right_index=True) recommands = recommands.rename(columns={0: 'cosine_similarity'}) recommands.filter(['cosine_similarity', 'title']) ``` ![](https://i.imgur.com/Ha4Sctl.png) ### Item-Base AI 修改版本 ```python= import pandas as pd import numpy as np import random from sklearn.metrics.pairwise import cosine_similarity # 輸入參數內容 # 1~968 target_user = 169 # 1~5 like_number_threshold = 3 # 15~25 compare_movie_threshold = 25 # 5~15 amount_of_recommended_movies = 10 # 資料載入 data_header = ["user_id", "item_id", "rating", "timestamp"] data_pd = pd.read_csv("u.data.data", sep = '\t', header = None, names=data_header) movie_header = ["item_id", "title", "release_date", "video_release_date", "IMDb_URL", "unknown", "Action", "Adventure", "Animation","Children's", "Comedy", "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir", "Horror", "Musical", "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"] movies = pd.read_csv('u.item.item', sep = '|', header = None, encoding = 'latin1', names = movie_header) print('資料載入完成') # 選出符合條件的電影 condition1 = data_pd['user_id'] == target_user condition2 = data_pd['rating'] > like_number_threshold user_like_pd = data_pd[condition1 & condition2] # 假如使用者沒有任何一部電影資料就終止程式 if user_like_pd.empty: sys.exit('強制結束') print('選出符合條件的電影完成', user_like_pd) # user-item martix 建構 user_id_max = data_pd['user_id'].max() item_id_max = data_pd['item_id'].max() rank_matrix = np.zeros((item_id_max, user_id_max)) # 對應 rating 填表 for i in range(len(data_pd)): data = data_pd.iloc[i] # id 都是從 1 開始所以要減一,才會對應到 rank_matrix[data['item_id'] - 1, data['user_id'] - 1] = data['rating'] print('rating matrix 填表完成') # movie_cos_sim movie_cos_sim = cosine_similarity(rank_matrix) print('cosine similarity 完成') # 加總使用者喜愛的電影的相似度 pd_count_number = len(user_like_pd) print('使用者喜愛的電影數量: ', pd_count_number) total_movie_cosine_similarity = np.zeros((1, item_id_max)) # 假如使用者少於 compare_movie_threshold 部,就利用已有電影資料去加總 if pd_count_number < compare_movie_threshold: for i in range(pd_count_number): data = user_like_pd.iloc[i] total_movie_cosine_similarity += movie_cos_sim[data['item_id'] - 1] # 假如使用者超過 compare_movie_threshold 部,就隨機選取電影去加總 else: # 隨機選取 compare_movie_threshold 部電影 random_index_list = random.sample(range(pd_count_number), compare_movie_threshold) for i in random_index_list: data = user_like_pd.iloc[i] total_movie_cosine_similarity += movie_cos_sim[data['item_id'] - 1] print('相似度加總完成') # 相似度排序,並取出前 amount_of_recommended_movies 部電影 recommands = pd.DataFrame(total_movie_cosine_similarity[0], columns=['cosine_similarity']) recommands = pd.merge(recommands, movies[['item_id']], left_index=True, right_index=True) # 相似度排序,並取出前 amount_of_recommended_movies 部電影 recommands = recommands.sort_values(by=['cosine_similarity'], ascending=False) recommands = recommands.iloc[:amount_of_recommended_movies] # 輸出推薦電影的標題 print('推薦電影列表:') for i in range(len(recommands)): data = recommands.iloc[i] print(movies[movies['item_id'] == data['item_id']]['title'].values[0]) ``` ![](https://i.imgur.com/Ema14iP.png) ### SVC :::info 用SVC做一般的分類,但製作後會發現 標籤1、2 的資料集較少,訓練出來的 Recall 為 0。 隨後用 SMOTE 方法,將各個標籤的資料集,拉到相同筆數,Recall 也相對應的上升了。 後續在分類的 accurcy 普遍偏低,故利用 將標籤 1,2,3 轉為 0 (不推薦),標籤 4,5 轉為 1 (推薦),以此去提高 accurcy,藉此可以更好的去判別是否將電影推薦給使用者。 ::: 相關模組匯入 ```python= # 匯入資料處理 from sklearn.model_selection import train_test_split # 模型評估 from sklearn import metrics from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score from sklearn.metrics import classification_report,confusion_matrix from sklearn.model_selection import cross_val_score # 匯入自動標準化模組 from sklearn.preprocessing import StandardScaler # 匯入 smoke 模組 from imblearn.over_sampling import SMOTE ``` 實作資料標準化、資料切分和smote資料集(讓label的資料平均) ```python=+ scaler = StandardScaler() #分割資料集 X = database.drop('rating',axis=1)#題目 X = scaler.fit_transform(X) y = database['rating']#答案 X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3) # smote data,讓各 label 資料平均 sm = SMOTE(random_state=42) X_train, y_train = sm.fit_resample(X_train ,y_train) ``` 定義SVC參數,訓練,列出評估項目,並實作交叉驗證資料集切割三份 ```python=+ from sklearn import svm clf = svm.SVC(kernel='poly', degree=3, coef0=1, C=5) clf.fit(X_train,y_train) clf_predictions = clf.predict(X_test) clf_acc_score=metrics.accuracy_score(y_test, clf_predictions) clf_f1_score=metrics.f1_score(y_test, clf_predictions, average='macro') print('Test Accuracy score: ', clf_acc_score) print('Test F1 score: ', clf_f1_score) print('-----------------------------------------------------\n') print('classification_report :\n') print(classification_report(y_test,clf_predictions)) print('-----------------------------------------------------\n') print('confusion_matrix :\n') print(confusion_matrix(y_test,clf_predictions)) print('-----------------------------------------------------\n') #cross_val_score #交叉驗證 整個資料集分3份(cv=3) 兩份訓練一份測試 替換不同的測資重複三次 clf_scores = cross_val_score(clf,X,y,cv=3,scoring='accuracy') print(clf_scores) print('-----------------------------------------------------\n') print('scores-mean\n') print(clf_scores.mean()) ``` ![](https://i.imgur.com/6nS69XR.png) ## 推薦系統比較總結 ### 當使用者沒有觀看任何電影,可以使用 item-base、評論次數最高、總評分數最高,這幾種去推薦 主觀使用說明: item-base 是使用者評分相似度資料,若後續使用者本身與過往使用者相近,則會是很優的推薦。其次是總評分數最高。最後是評論次數最高,評論次數高也可能是其本身輿論性質非觀看喜愛。 ### 當使用者開始有觀看電影,則可以使用 user-base、SVC、item-base搭配使用者,去做推薦 主觀使用說明: 當電影眾多,可以混合 user-base 和 item-base搭配使用者,可以利用隨機數量的 使用者 相似度搭配,或隨機數量的 電影 相似度搭配,就可以更多元的推薦電影給使用者。 SVC,則可以對應剛上映的電影去計算,是否推薦給使用者,而在計算上可以先使用 推薦和不推薦 去得出結果,若為推薦,則可以再用 label 1~5 去得到結果,看要設定多少再推薦給使用者,提高推薦精準度。