KKBox's Music Recommendation（已結束之kaggle競賽)

--- title: KKBox's Music Recommendation（已結束之kaggle競賽) --- # 介紹 ## 研究目的 [KKBOX 歌曲推薦系統](https://www.kaggle.com/c/kkbox-music-recommendation-challenge)這個資料集，是由亞洲領先的音樂串流媒體服務KKBOX所提供，他們擁有超過3000萬首曲目。他們目前是使用協同過濾法，在推薦系統中進行矩陣分解和單詞嵌入。本次期望由 Kaggle上的 dataset 進行機器學習與深度學習練習，並選定一種方法進行預測與建議。 >In this task, you will be asked to predict the chances of a user listening to a song repetitively after the first observable listening event within a time window was triggered. If there are recurring listening event(s) triggered within a month after the user’s very first observable listening event, its target is marked 1, and 0 otherwise in the training set. The same rule applies to the testing set. KKBOX provides a training data set consists of information of the first observable listening event for each unique user-song pair within a specific time duration. Metadata of each unique user and song pair is also provided. The use of public data to increase the level of accuracy of your prediction is encouraged. The train and the test data are selected from users listening history in a given time period. Note that this time period is chosen to be before the WSDM-KKBox Churn Prediction time period. The train and test sets are split based on time, and the split of public/private are based on unique user/song pairs. >---截自KKBOX's Music Recommendation Challenge 的 Data Description ## 研究目標 - 分析資訊 ○ 推薦的歌曲是否準確 - 整理資料並分析 ○ 資料正規劃後利用LightGBM進行預測 # 實作解析 ## 資料集介紹 ```python= # pandas and numpy import pandas as pd import numpy as np pd.set_option('display.max_columns', None) import warnings warnings.filterwarnings('ignore') # other import string import math import missingno as msno # data viz import seaborn as sns from matplotlib import pyplot as plt %matplotlib inline # from autoviz.AutoViz_Class import AutoViz_Class # %matplotlib inline import plotly.express as px %matplotlib inline # sklearn - other from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV, StratifiedKFold from sklearn.metrics import roc_curve, roc_auc_score, accuracy_score, confusion_matrix, classification_report from sklearn.preprocessing import OneHotEncoder, LabelEncoder, MinMaxScaler # lightgbm import lightgbm as lgbm ``` 這次主要利用 sklearn 為主要的學習工具，和兩個視覺化的工具matplotlib、seaborn ```python= songs_df = pd.read_csv("/Users/yangzhelun/Desktop/kkboxRecommdation/input/songs.csv") songs_extra_df = pd.read_csv("/Users/yangzhelun/Desktop/kkboxRecommdation/input/song_extra_info.csv") members_df = pd.read_csv("/Users/yangzhelun/Desktop/kkboxRecommdation/input/members.csv") train_df = pd.read_csv("/Users/yangzhelun/Desktop/kkboxRecommdation/input/train.csv", nrows = 100000) t_s = pd.![Uploading file..._qm94b35yu]() merge(train_df, songs_df, on='song_id', how='left') t_s_se = pd.merge(t_s, songs_extra_df, on='song_id', how='left') songs = pd.merge(t_s_se, members_df, on='msno', how='left') del songs_df, songs_extra_df, members_df, train_df, t_s, t_s_se ``` 再來開始匯入資料，分別是歌曲基本資料(songs_df)、歌曲延伸資料(songs_extra_df)、會員資料(members_df)以及訓練用資料(train_df)，匯入後再進行合併。 ```python= songs.info() ``` ><class 'pandas.core.frame.DataFrame'> Int64Index: 100000 entries, 0 to 99999 Data columns (total 20 columns): 0 msno 100000 non-null object 1 song_id 100000 non-null object 2 source_system_tab 99697 non-null object 3 source_screen_name 95727 non-null object 4 source_type 99805 non-null object 5 target 100000 non-null int64 6 song_length 99996 non-null float64 7 genre_ids 98498 non-null object 8 artist_name 99996 non-null object 9 composer 78528 non-null object 10 lyricist 59309 non-null object 11 language 99996 non-null float64 12 name 99991 non-null object 13 isrc 91475 non-null object 14 city 100000 non-null int64 15 bd 100000 non-null int64 16 gender 61328 non-null object 17 registered_via 100000 non-null int64 18 registration_init_time 100000 non-null int64 19 expiration_date 100000 non-null int64 dtypes: float64(2), int64(6), object(12) memory usage: 16.0+ MB > 透過 dataframe.info() 檢視資料的概要資訊 ## 資料集處理 ```python= msno.matrix(songs) ``` ![](https://i.imgur.com/Z3rSysi.png) 透過以上這張圖我們可以清楚看到NULL值的分佈情形 ```python= #將null值改為unknown for i in songs.select_dtypes(include=['object']).columns: songs[i][songs[i].isnull()] = 'unknown' songs = songs.fillna(value=0) ``` 我們將所有欄位裡的Null值改為字串"unknown"方便後面計算 ![](https://i.imgur.com/00ojJKo.png) 再來查看分佈 ```python= # registration_init_time #將時間拆分以便後面計算 songs.registration_init_time = pd.to_datetime(songs.registration_init_time, format='%Y%m%d', errors='ignore') songs['registration_init_time_year'] = songs['registration_init_time'].dt.year songs['registration_init_time_month'] = songs['registration_init_time'].dt.month songs['registration_init_time_day'] = songs['registration_init_time'].dt.day # expiration_date songs.expiration_date = pd.to_datetime(songs.expiration_date, format='%Y%m%d', errors='ignore') songs['expiration_date_year'] = songs['expiration_date'].dt.year songs['expiration_date_month'] = songs['expiration_date'].dt.month songs['expiration_date_day'] = songs['expiration_date'].dt.day ``` ![](https://i.imgur.com/1BnQSux.png) ```python= #大部分模型都是運用數學運算，因此必須運用Label Encoding將資料轉成數字型態 label_encoder = LabelEncoder() one_hot = OneHotEncoder() for i in songs.columns : songs[i] = label_encoder.fit_transform(songs[i]) songs.head() ``` 大部分模型都是運用數學運算，因此必須運用Label Encoding將資料轉成數字型態 ![](https://i.imgur.com/HpxFTvt.png) ```python= #查看相關係數 plt.figure(figsize=[15,10]) sns.heatmap(songs.corr()) plt.show() ``` ![](https://i.imgur.com/3KBbHl5.png) 我們可以透過視覺化輔助工具去了解各個特徵的相關係數 ## 模型訓練 ```python= #查看相關係數 X = songs.drop('target', axis = 1) y = songs.target X_train, X_val, Y_train, Y_val = train_test_split(X, y, test_size = 0.25, random_state = 0) ``` 首先先將題目與答案分開後，再利用train_test_split()這個函式去劃分測試集以及訓練集 ```python= params = { 'objective': 'binary',# 目標函數 'metric': 'binary_logloss',# 葉子節點數 'boosting': 'gbdt',# 設置提升類型 'learning_rate': 0.3 ,# 學習速率 'verbose': 0, 'num_leaves': 108, 'bagging_fraction': 0.95,# 建樹的樣本採樣比例 'bagging_freq': 1,# k 意味着每 k 次迭代執行bagging 'bagging_seed': 1, 'feature_fraction': 0.9,# 建樹的特徵選擇比例 'feature_fraction_seed': 1, 'max_bin': 256, 'max_depth': 10,#設置數據的深度，防止過擬合 'num_rounds': 200, 'metric' : 'auc' } %time model_f1 = lgbm.train(params, train_set=train_set, valid_sets=train_set, verbose_eval=5) ``` 給定參數後開始訓練lightGBM模型 ![](https://i.imgur.com/t3rTJfZ.png) ```python= pred_test = model_f1.predict(X_val) print('Saving Predictions') sub = pd.DataFrame() sub['id'] = Y_val.index sub['target'] = pred_test sub.to_csv('2st_submission.csv' , index = False , float_format ='%.5f' ) ``` 將資料放進訓練好的模型進行預測並取存檔 ```python= print('The rmse of prediction is:', mean_squared_error(Y_val, pred_test) ** 0.5) # 計算真實值和預測值之間的均方根誤差 ``` The rmse of prediction is: 0.34108641248598415 最後就是驗收成果 ```python= sub.head() ``` ![](https://i.imgur.com/4csSNj7.png) # 研究結果與討論＿　　過程中最煩瑣的地方還是資料的處理，往往需要思考該怎麼處理資料才會有利於後面的計算以及模型的訓練。而演算法選擇的部分，我會選擇LightGBM的原因是因為他結合了許多演算法的優點，而且相較於XGBoost，他對於效能的消耗與釋放都較為傑出。但是訓練出來的結果其實準確率只有6成多，或許是前面資料的處理並不是很完善，或者LightGBM並不是最合適的選擇，這是日後再修改程式或者挑戰新主題的一大重點，也是需要去進步以及克服的點 ![](https://i.imgur.com/wRcOrtU.png)