[TOC]
# 推薦系統期末作業
## 模組導入
工具:jupyter notebook
```python
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
%matplotlib inline
# 匯入資料處理
from sklearn.model_selection import train_test_split
# 模型評估
from sklearn import metrics
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.model_selection import cross_val_score
# 匯入自動標準化模組
from sklearn.preprocessing import StandardScaler
# 匯入 smoke 模組
from imblearn.over_sampling import SMOTE
from sklearn import svm
```
## 資料描述
### 資料導入
```python=
#u.data 評分資料
rating_header = ["user_id", "item_id", "rating", "timestamp"]
rating = pd.read_csv("u.data.data", sep = '\t', header = None, names=rating_header)
print(rating.head())
print('\n')
rating.info()
print("__________________________________________________________________________\n")
#u.item 電影資料
movie_header = ["item_id", "title", "release_date", "video_release_date", "IMDb_URL",
"unknown", "Action", "Adventure", "Animation","Children's", "Comedy", "Crime",
"Documentary", "Drama", "Fantasy", "Film-Noir", "Horror", "Musical", "Mystery",
"Romance", "Sci-Fi", "Thriller", "War", "Western"]
movies = pd.read_csv('u.item.item', sep = '|', header = None, encoding = 'latin1', names = movie_header)
print(movies.head())
print('\n')
movies.info()
print("__________________________________________________________________________\n")
#u.user 使用者資料
user_header = ["user_id", "age", "gender", "occupation", "zip_code"]
user = pd.read_csv('u.user.user', sep = '|', header = None, encoding = 'latin1', names = user_header)
print(user.head())
print('\n')
user.info()
print("__________________________________________________________________________\n")
```
### 資料描述
#### 年齡
```python
user.hist('age')
```

#### 職業
學生較多
```python
occupation_count = user[["user_id", "occupation"]].groupby("occupation", as_index=False).size()
plt.pie(occupation_count["size"], labels=occupation_count["occupation"])
plt.title("Occupation Distribution")
plt.axis("equal")
plt.show()
```

#### 性別
男性較多
```python
users_num.hist('gender')
plt.title('Gender')
```

#### 相關性分析
##### pearson
先進行資料前處理,最後找出與 rating 平均相關性分數高的項目
```python
#資料合併
#不確定要不要用清理後的rating 暫用原始rating
all_data = pd.merge(pd.merge(user,rating), movies)
# 移除字串和未使用欄位
all_data = all_data.drop(['title', 'video_release_date', 'release_date', 'IMDb_URL', 'unknown', 'timestamp', "zip_code"],axis=1)
#職業
occupation = pd.read_csv("u.occupation.occupation", header = None)
#有幾種職業
occupation_list = occupation.values
all_data["occupation"].replace(occupation_list,list(range(0, len(occupation_list))), inplace=True)
all_data["gender"].replace(['F', 'M'],[0, 1], inplace=True)
all_data["rating"] = all_data["rating"].astype(int)
# 資料關聯性
all_corr = all_data.corr().abs()
label_name = 'rating'
all_corr_higher_mean = all_corr > all_corr.median()
corr_list = all_corr_higher_mean[label_name]
corr_list = corr_list[corr_list == True].index.tolist()
higher_mean_corr_list = []
for higher_mean_name in corr_list:
higher_mean_corr_list.append({'name': higher_mean_name ,'value':all_corr[label_name][higher_mean_name]})
higher_mean_corr_list.sort(key=lambda k: (k.get('value', 0)), reverse = True)
higher_mean_corr_list = higher_mean_corr_list[1:]
higher_mean_corr_list
```

:::success
電影項目、年齡、職業、電影種類,這四項相關性最高
:::
### 資料前處理
#### 將性別和職業轉為數值
```python
#性別
users_num = user.copy()
#映射
users_num["gender"].replace(['F', 'M'],[0, 1], inplace=True)
#職業
occupation = pd.read_csv("u.occupation.occupation", header = None)
#有幾種職業
occupation_list = occupation.values
#映射
users_num["occupation"].replace(occupation_list,list(range(0, len(occupation_list))), inplace=True)
```
#### 將 rating outlier 移除
```python
#評分的box_plot
#Clean outlier
df = rating
def box_plot(col_name):
bp = df.boxplot(column=[col_name])
bp.plot()
plt.show()
def clean_outlier(col_name):
q1, q2 , q3 = df[col_name].quantile([0.25, 0.5, 0.75])
IQR = q3-q1
lower_cap = q1 - 1.5*IQR
upper_cap = q3 + 1.5*IQR
df[col_name] = df[col_name].apply(lambda x: upper_cap if x > upper_cap else (lower_cap if (x<lower_cap) else x))
box_plot('rating')
clean_outlier('rating')
box_plot('rating')
#清理完後寫進"cleaned_rating.csv"中
df.to_csv("cleaned_rating.csv", index=False, encoding="utf8")
```

:::success
rating 數值為 1 是 outlier
:::
#### 資料標準化
##### 針對 rating 標準化
```python
#rating標準化
#zscore
users_rating_scale = rating.copy()
#rating的mu
mu = users_rating_scale['rating'].mean()
#rating的std
std = users_rating_scale['rating'].std()
z_score_normalized = (users_rating_scale['rating'] - mu)/std
users_rating_scale['rating'] = z_score_normalized
#寫進rating_scaled.csv中
users_rating_scale.to_csv("rating_scaled.csv", index=False, encoding="utf8")
users_rating_scale.head()
```

### 資料整合計算
:::info
我們想知道的事情
:::
#### 每部電影的男女平均和中位數分數
```python
#按性別區分 計算每部電影的男女觀眾評分的平均值與中位數
mean_rating_gender = all_data.pivot_table(values ='rating', index ='title', columns ='gender', aggfunc =[np.mean,np.median])
#印出前10筆
mean_rating_gender.head(10)
```

#### 男生平均最高分前五部對應去女生
```python
#男女觀眾平均評分前五
mean_rating_gender = all_data.pivot_table(values ='rating', index ='title', columns ='gender', aggfunc = 'mean')
#男生平均前五
mean_rating_gender.sort_values(by='M', ascending = False).head()
```

:::success
男生最喜歡的電影,女生也喜歡的較少
:::
#### 女生平均最高分前五部對應去男生
```python
#男女觀眾平均評分前五
mean_rating_gender = all_data.pivot_table(values ='rating', index ='title', columns ='gender', aggfunc = 'mean')
#女生平均前五
mean_rating_gender.sort_values(by='F', ascending = False).head()
```

#### 男女喜愛差異最大的電影
```python
#男女品味衝突前十的電影
mean_rating_gender['diff'] = mean_rating_gender.F-mean_rating_gender.M
mean_rating_gender.sort_values(by='diff', ascending = False).head(10)
```

#### 評分次數越多越受歡迎
```python
#受歡迎排行(評分次數越多越受歡迎)
rating_movie = all_data.groupby('title').size()
#次數>100視為受歡迎
top_rating = rating_movie[rating_movie > 100]
#前十名
top_10_rating = top_rating.sort_values(ascending =False)
top_10_rating.head(10)
```

#### 前20高分電影的評分
```python
#前20大高分電影
mean_rating = all_data.pivot_table(values = 'rating', index = 'title', aggfunc = 'mean')
top_20_highscore = mean_rating.sort_values(by='rating', ascending=False)
top_20_highscore.head(20)
```

#### 前20高分電影的評分次數
```python
#前20高分的評分程度->高分電影可能受眾很少 但分數給很高
rating_movie[top_20_highscore.index]
```

:::success
有些高分電影,受眾很少,但分數給很高
:::
## 推薦方法
### User-Base
:::info
對於使用者已經有相當觀看電影紀錄,可以使用此方法推薦
:::
找到與使用者前 50 名高相似度的使用者,將相似度代入他們喜愛的電影,加總做推薦
```python=
'''
基於Pearson相關係數判斷資料集中其他用戶與目標用戶的相似性,
取最相似的50個用戶加權計算推薦係數,
排序後推薦得分最高的十部電影。
'''
import numpy as np
import math
def loadData():
f = open('u.data.data')
data = []
for i in range(100000):
h = f.readline().split('\t')
h = list(map(int, h))
data.append(h[0:3])
f.close()
return data
def loadMovieName():
f = open('u.item.item', encoding = 'ISO-8859-1')
name = []
for i in range(1682):
h = f.readline()
k=''
m=0
for j in range(100):
k+=str(h[j])
if str(h[j])=='|':
m+=1
if m == 2:
break
name.append(k)
f.close()
return name
#整理資料 每一行是一個user對所有電影的對應評分的一個表(943*1682的矩陣)
def manageDate(data):
outdata = []
for i in range(943):
outdata.append([])
for j in range(1682):
outdata[i].append(0)
for h in data:
outdata[h[0] - 1][h[1] - 1]= h[2]
return outdata
#先求list的平均值
def calcMean(x, y):
sum_x = sum(x)
sum_y = sum(y)
n = len(x)
x_mean = float(sum_x + 0.0)/n
y_mean = float(sum_y + 0.0)/n
return x_mean, y_mean
#在算pearson相關係數
def calcPearson(x, y):
x_mean, y_mean = calcMean(x, y)
n = len(x)
sumTop = 0.0
sumBottom = 0.0
x_pow = 0.0
y_pow = 0.0
for i in range(n):
sumTop +=(x[i] - x_mean)*(y[i] - y_mean)
for i in range(n):
x_pow += math.pow(x[i] - x_mean, 2)
for i in range(n):
y_pow += math.pow(y[i] - y_mean, 2)
sumBottom = math.sqrt(x_pow * y_pow)
p = sumTop/sumBottom
return p
def calcAttribute(dataSet, num):
prr = []
#獲取dataset行數和列數
n, m = np.shape(dataSet)
#初始化特徵X和類別Y向量
x = [0] * m
y = [0] * m
y = dataSet[num - 1]
#取得每個特徵的向量,並計算pearson存入列表中
for j in range(n):
x = dataSet[j]
prr.append(calcPearson(x, y))
return prr
#取最相似的50用戶加權計算推薦係數 排序後推薦得分最高的10部電影
def choseMovie(outdata, num):
prr = calcAttribute(outdata, num)
list = []
mid = []
out_list = []
movie_rank = []
for i in range(1682):
movie_rank.append([i, 0])
k = 0
for i in range(943):
list.append([i, prr[i]])
for i in range(943):
for j in range(942-i):
if list[j][1]<list[j+1][1]:
mid = list[j]
list[j] = list[j+1]
list[j+1] = mid
for i in range(1, 51):
print(i, list[i][1])
for j in range(0, 1682):
movie_rank[j][1] = movie_rank[j][1]+outdata[list[i][0]][j]*list[i][1]/50
#排序
for i in range(1682):
for j in range(1681 - i):
if movie_rank[j][1]<movie_rank[j+1][1]:
mid = movie_rank[j]
movie_rank[j] = movie_rank[j+1]
movie_rank[j+1] = mid
#取前十
for i in range(1, 1682):
if(outdata[num-1][movie_rank[i][0]]==0):
mark = 0
for d in out_list:
if d[0] == j:
mark = 1
if mark!=1:
k+=1
out_list.append(movie_rank[i])
if k == 10:
break
return movie_rank
def printMovie(out_list, name):
print('base on the data we think you may like these movies: ')
for i in range(10):
print(name[out_list[i][0]], "rank score:", out_list[i][1])
i_data = loadData()
name = loadMovieName()
out_data = manageDate(i_data)
#user_id
user = 100
out_list = choseMovie(out_data, user)
printMovie(out_list, name)
#print('end_______________\n')
```

### Item-Base
:::info
在使用者無資料,可以單純利用電影的cosine_similarity去推薦電影
或
對於使用者已經有相當觀看電影紀錄,使用此方法推薦
:::
找到使用者喜愛的電影 n 部,將喜愛的電影相似度列表做加總,選取前 m 部,當使用者喜愛電影超過 n 部,則會隨機挑選,則不會重複推薦使用者一樣的電影。
```python=
import pandas as pd
import numpy as np
import numpy.matlib
import random
import sys
from sklearn.metrics.pairwise import cosine_similarity
# 輸入參數內容
# 1~968
target_user = 169
# 1~5
like_number_threshold = 3
# 15~25
compare_movie_threshold = 25
# 5~15
amount_of_recommended_movies = 10
# 資料載入
data_header = ["user_id", "item_id", "rating", "timestamp"]
data_pd = pd.read_csv("u.data.data", sep = '\t', header = None, names=data_header)
movie_header = ["item_id", "title", "release_date", "video_release_date", "IMDb_URL",
"unknown", "Action", "Adventure", "Animation","Children's", "Comedy", "Crime",
"Documentary", "Drama", "Fantasy", "Film-Noir", "Horror", "Musical", "Mystery",
"Romance", "Sci-Fi", "Thriller", "War", "Western"]
movies = pd.read_csv('u.item.item', sep = '|', header = None, encoding = 'latin1', names = movie_header)
print('資料載入完成')
# 選出符合條件的電影
condition1 = data_pd['user_id'] == target_user
condition2 = data_pd['rating'] > like_number_threshold
user_like_pd = data_pd[condition1 & condition2]
# 假如使用者沒有任何一部電影資料就回覆 空陣列
if(len(user_like_pd) == 0):
sys.exit('強制結束');
# print('選出符合條件的電影完成', user_like_pd)
def output_matrix(data_pd):
# user-item martix 建構
user_id_max = data_pd['user_id'].max()
item_id_max = data_pd['item_id'].max()
rank_matrix = np.matlib.zeros((item_id_max, user_id_max))
# 對應 rating 填表
for i in range(len(data_pd)):
data = data_pd.iloc[i]
# id 都是從 1 開始所以要減一,才會對應到
rank_matrix[data['item_id'] - 1, data['user_id'] - 1] = data['rating']
print('rating matrix 填表完成')
# movie_cos_sim
movie_cos_sim = cosine_similarity(rank_matrix)
print('cosine similarity 完成')
return movie_cos_sim
movie_cos_sim = output_matrix(data_pd)
def sum_movie_cosine_similarity(user_like_pd, movie_cos_sim):
pd_count_number = len(user_like_pd)
print('使用者喜愛的電影數量: ', pd_count_number)
total_movie_cosine_similarity = np.matlib.zeros((1, item_id_max))
# 假如使用者少於 compare_movie_threshold 部,就利用已有電影資料去加總
if(pd_count_number < compare_movie_threshold):
print('使用全部加總取得所有電影相似值')
for i in range(0, pd_count_number - 1):
_item_id = user_like_pd.iloc[i]['item_id']
total_movie_cosine_similarity += movie_cos_sim[_item_id - 1] # 減一才是該對影對應的值
# 假如使用者有大於等於 compare_movie_threshold 部電影資料,隨機挑選 compare_movie_threshold 部加總
if(pd_count_number >= compare_movie_threshold):
print('使用隨機加總取得所有電影相似值')
for i in range(0, compare_movie_threshold):
radom_number = random.randint(0,pd_count_number - 1)
_item_id = user_like_pd.iloc[radom_number]['item_id']
print('item_id:', _item_id, movie_cos_sim[_item_id].sum())
total_movie_cosine_similarity += movie_cos_sim[_item_id - 1] # 減一才是該對影對應的值
total_movie_cosine_similarity_pd = pd.DataFrame(total_movie_cosine_similarity[0])
total_movie_cosine_similarity_pd = total_movie_cosine_similarity_pd.T
return total_movie_cosine_similarity_pd
total_movie_cosine_similarity_pd = sum_movie_cosine_similarity(user_like_pd, movie_cos_sim)
print('找到使用者喜愛電影相似度加總')
def movie_cos_without_user_like(user_like_pd, total_movie_cosine_similarity_pd):
# 將看到的影片相似度轉為 0
user_like_item_list = user_like_pd['item_id'].values
# 轉為 index 位置
user_like_item_list = user_like_item_list -1
for item_index in user_like_item_list:
total_movie_cosine_similarity_pd[0][item_index] = 0
# 只取得 0 以上的內容
movie_similarity_pd = total_movie_cosine_similarity_pd[total_movie_cosine_similarity_pd > 0]
movie_similarity_pd = movie_similarity_pd.dropna()
return movie_similarity_pd
clear_movie_cos = movie_cos_without_user_like(user_like_pd, total_movie_cosine_similarity_pd)
# 排名高到低
clear_movie_cos = clear_movie_cos.sort_values(by=0, ascending=False)
print('將使用者喜愛影片轉為0,並只保留0以上的值')
# print('clear_movie_cos: \n', clear_movie_cos)
top_movies = movies.filter(clear_movie_cos[0:amount_of_recommended_movies].index.tolist(), axis=0)
print('取得前 amount_of_recommended_movies 部推薦電影')
# print('top_movies: \n', top_movies)
recommands = pd.merge(top_movies, clear_movie_cos, left_index=True, right_index=True)
recommands = recommands.rename(columns={0: 'cosine_similarity'})
recommands.filter(['cosine_similarity', 'title'])
```

### Item-Base AI 修改版本
```python=
import pandas as pd
import numpy as np
import random
from sklearn.metrics.pairwise import cosine_similarity
# 輸入參數內容
# 1~968
target_user = 169
# 1~5
like_number_threshold = 3
# 15~25
compare_movie_threshold = 25
# 5~15
amount_of_recommended_movies = 10
# 資料載入
data_header = ["user_id", "item_id", "rating", "timestamp"]
data_pd = pd.read_csv("u.data.data", sep = '\t', header = None, names=data_header)
movie_header = ["item_id", "title", "release_date", "video_release_date", "IMDb_URL",
"unknown", "Action", "Adventure", "Animation","Children's", "Comedy", "Crime",
"Documentary", "Drama", "Fantasy", "Film-Noir", "Horror", "Musical", "Mystery",
"Romance", "Sci-Fi", "Thriller", "War", "Western"]
movies = pd.read_csv('u.item.item', sep = '|', header = None, encoding = 'latin1', names = movie_header)
print('資料載入完成')
# 選出符合條件的電影
condition1 = data_pd['user_id'] == target_user
condition2 = data_pd['rating'] > like_number_threshold
user_like_pd = data_pd[condition1 & condition2]
# 假如使用者沒有任何一部電影資料就終止程式
if user_like_pd.empty:
sys.exit('強制結束')
print('選出符合條件的電影完成', user_like_pd)
# user-item martix 建構
user_id_max = data_pd['user_id'].max()
item_id_max = data_pd['item_id'].max()
rank_matrix = np.zeros((item_id_max, user_id_max))
# 對應 rating 填表
for i in range(len(data_pd)):
data = data_pd.iloc[i]
# id 都是從 1 開始所以要減一,才會對應到
rank_matrix[data['item_id'] - 1, data['user_id'] - 1] = data['rating']
print('rating matrix 填表完成')
# movie_cos_sim
movie_cos_sim = cosine_similarity(rank_matrix)
print('cosine similarity 完成')
# 加總使用者喜愛的電影的相似度
pd_count_number = len(user_like_pd)
print('使用者喜愛的電影數量: ', pd_count_number)
total_movie_cosine_similarity = np.zeros((1, item_id_max))
# 假如使用者少於 compare_movie_threshold 部,就利用已有電影資料去加總
if pd_count_number < compare_movie_threshold:
for i in range(pd_count_number):
data = user_like_pd.iloc[i]
total_movie_cosine_similarity += movie_cos_sim[data['item_id'] - 1]
# 假如使用者超過 compare_movie_threshold 部,就隨機選取電影去加總
else:
# 隨機選取 compare_movie_threshold 部電影
random_index_list = random.sample(range(pd_count_number), compare_movie_threshold)
for i in random_index_list:
data = user_like_pd.iloc[i]
total_movie_cosine_similarity += movie_cos_sim[data['item_id'] - 1]
print('相似度加總完成')
# 相似度排序,並取出前 amount_of_recommended_movies 部電影
recommands = pd.DataFrame(total_movie_cosine_similarity[0], columns=['cosine_similarity'])
recommands = pd.merge(recommands, movies[['item_id']], left_index=True, right_index=True)
# 相似度排序,並取出前 amount_of_recommended_movies 部電影
recommands = recommands.sort_values(by=['cosine_similarity'], ascending=False)
recommands = recommands.iloc[:amount_of_recommended_movies]
# 輸出推薦電影的標題
print('推薦電影列表:')
for i in range(len(recommands)):
data = recommands.iloc[i]
print(movies[movies['item_id'] == data['item_id']]['title'].values[0])
```

### SVC
:::info
用SVC做一般的分類,但製作後會發現 標籤1、2 的資料集較少,訓練出來的 Recall 為 0。
隨後用 SMOTE 方法,將各個標籤的資料集,拉到相同筆數,Recall 也相對應的上升了。
後續在分類的 accurcy 普遍偏低,故利用 將標籤 1,2,3 轉為 0 (不推薦),標籤 4,5 轉為 1 (推薦),以此去提高 accurcy,藉此可以更好的去判別是否將電影推薦給使用者。
:::
相關模組匯入
```python=
# 匯入資料處理
from sklearn.model_selection import train_test_split
# 模型評估
from sklearn import metrics
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.model_selection import cross_val_score
# 匯入自動標準化模組
from sklearn.preprocessing import StandardScaler
# 匯入 smoke 模組
from imblearn.over_sampling import SMOTE
```
實作資料標準化、資料切分和smote資料集(讓label的資料平均)
```python=+
scaler = StandardScaler()
#分割資料集
X = database.drop('rating',axis=1)#題目
X = scaler.fit_transform(X)
y = database['rating']#答案
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3)
# smote data,讓各 label 資料平均
sm = SMOTE(random_state=42)
X_train, y_train = sm.fit_resample(X_train ,y_train)
```
定義SVC參數,訓練,列出評估項目,並實作交叉驗證資料集切割三份
```python=+
from sklearn import svm
clf = svm.SVC(kernel='poly', degree=3, coef0=1, C=5)
clf.fit(X_train,y_train)
clf_predictions = clf.predict(X_test)
clf_acc_score=metrics.accuracy_score(y_test, clf_predictions)
clf_f1_score=metrics.f1_score(y_test, clf_predictions, average='macro')
print('Test Accuracy score: ', clf_acc_score)
print('Test F1 score: ', clf_f1_score)
print('-----------------------------------------------------\n')
print('classification_report :\n')
print(classification_report(y_test,clf_predictions))
print('-----------------------------------------------------\n')
print('confusion_matrix :\n')
print(confusion_matrix(y_test,clf_predictions))
print('-----------------------------------------------------\n')
#cross_val_score
#交叉驗證 整個資料集分3份(cv=3) 兩份訓練一份測試 替換不同的測資重複三次
clf_scores = cross_val_score(clf,X,y,cv=3,scoring='accuracy')
print(clf_scores)
print('-----------------------------------------------------\n')
print('scores-mean\n')
print(clf_scores.mean())
```

## 推薦系統比較總結
### 當使用者沒有觀看任何電影,可以使用 item-base、評論次數最高、總評分數最高,這幾種去推薦
主觀使用說明:
item-base 是使用者評分相似度資料,若後續使用者本身與過往使用者相近,則會是很優的推薦。其次是總評分數最高。最後是評論次數最高,評論次數高也可能是其本身輿論性質非觀看喜愛。
### 當使用者開始有觀看電影,則可以使用 user-base、SVC、item-base搭配使用者,去做推薦
主觀使用說明:
當電影眾多,可以混合 user-base 和 item-base搭配使用者,可以利用隨機數量的 使用者 相似度搭配,或隨機數量的 電影 相似度搭配,就可以更多元的推薦電影給使用者。
SVC,則可以對應剛上映的電影去計算,是否推薦給使用者,而在計算上可以先使用 推薦和不推薦 去得出結果,若為推薦,則可以再用 label 1~5 去得到結果,看要設定多少再推薦給使用者,提高推薦精準度。