從遊戲基本資料和銷售數量推測遊戲評分

# 從遊戲基本資料和銷售數量推測遊戲評分 ## 前言本人遊戲年齡大約10年，雖然說不上很久，但是在這十年間也體驗過各式各樣的遊戲。有的令人驚艷、有的令人失望、有的令人哭笑不得······ 每個人心中都有自己的一把尺，從各個方面評斷一款遊戲是神Game還是糞Game。我想試圖用網路上公開的資料庫找出人們心中的最大公因數，來推測看看到底怎麼樣的遊戲會是好遊戲。 ## 資料集的基本資訊本此所使用的資料集為： [Video Game Sales with Ratings | Kaggle](https://www.kaggle.com/datasets/rush4ratio/video-game-sales-with-ratings) 這個資料集是從Metacritic這個遊戲評分網站抓取。內容包含： * Name - 遊戲的名字 * Platform - 遊戲平台 * Year_of_Release - 發行的年份 * Genre - 遊戲類型 * Publisher - 發行商 * NA_Sales - 北美銷售數 * EU_Sales - 歐洲銷售數 * JP_Sales - 日本銷售數 * Other_Sales - 其他地區銷售數 * Global_Sales - 全球銷售數 * Critic_score - Metacritic 工作人員編制的總分 * Criticcount - 提出 Critic_score 的評論家數量 * User_score - Metacritic 訂閱者評分 * Usercount -給出User_score的用戶數 * Developer - 遊戲開發者或團隊 * Rating - [ESRB](https://www.esrb.org/) 排名 *p.s. 銷售數單位為百萬，評分為0到100，標題說的遊戲基本資料就是指遊戲的發行年分、發行公司、開發者···等資料* 要注意的是，對資料集的敘述中有著這樣的內容 > Context Motivated by Gregory Smith's web scrape of VGChartz [Video Games Sales](https://www.kaggle.com/datasets/gregorut/videogamesales), this data set simply extends the number of variables with another web scrape from [Metacritic](https://www.metacritic.com/browse/games/release-date/available). Unfortunately, there are missing observations as Metacritic only covers a subset of the platforms. Also, a game may not have all the observations of the additional variables discussed below. Complete cases are ~ 6,900 簡單來說就是這個資料集可能沒有包含全部的遊戲平台，因此結果可能是不夠完整的。還有不是每一款遊戲都有上述全部的欄位，可能需要自己處理一下空值。 ## 分析思路這邊我先定義好遊戲就等於評分高的遊戲。如果我想要用這些資料找出好遊戲，那最值觀的方式應該是拿來做knn--分數當作target，其他欄位當作data。我這邊依照官方提供的分數分級方式來定義遊戲好壞： | 分數指征 | 電子遊戲 | 影視音樂 | | -------- | -------- | -------- | | 普遍讚譽 | 90-100 | 81-100 | | 正面評論為主 | 75-89 | 61-80 | | 褒貶不一或中庸 | 50-74 | 40-60 | | 負面評論為主 | 20-49 | 20-39 | | 壓倒式差評 | 0-19 | 0-19 | ## 分析數據產出heatmap和點陣圖 ``` def heatmap_plot(data): plt.figure(figsize=(15,5)) plt.title('Heatmap Correlation and Impact Distribution of Critics and Users') sns.heatmap(data[::].corr(), annot=True, cmap='mako', fmt='.3f') sns.jointplot(x='Critic_Score', y='Global_Sales', data=data, kind='scatter') sns.jointplot(x='User_Score', y='Global_Sales', data=data, kind='scatter') plt.show() heatmap_plot(game_df) ``` 完蛋，感覺都沒什麼關係耶！所以老遊戲沒有比較好玩? ![](https://i.imgur.com/tqOC2PR.png) 評論家分數感覺跟銷量稍微有點關係，但很弱(0.272) ![](https://i.imgur.com/cytiDfU.png) 玩家分數跟銷量的關係又更弱了(0.098) ![](https://i.imgur.com/zDx1nk3.png) 所以遊戲好壞跟銷量也沒什麼關係? ### 小結遊戲玩家不是常說某某遊戲公司很爛都糞Game；某某遊戲公司跟神一樣。 **那我先實驗一下只看發行商跟開發者能不能預測遊戲好壞** **然後把全部的數據一起丟進去knn看看有沒有辦法推測出遊戲好壞好了** ## 前處理 ### 步驟 1. 載入資料集 2. 把不能用的資料去掉 3. 看一下heatmap，觀察欄位間的關聯 4. One hot encoding，把標籤變可以給knn吃 5. 把分數變成分數指征 6. 把knn的答案跟題目抓出來 ### 過程載入套件 ``` import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier ``` 導入資料集 ``` game_df = pd.read_csv('Video_Games_Sales_as_at_22_Dec_2016.csv') ``` ![](https://i.imgur.com/LEGjVn8.png) 觀察一下每列的數量跟資料型態 ``` game_df.info() ``` ![](https://i.imgur.com/DF87Saf.png) User_Score因為裡面有些是sbd(待定)，所以資料型態是object。所以要把分數是空值跟待定的去掉，順便去掉極端值跟把分數統一為0-100。 ``` #去除極端值 game_df.drop(game_df.loc[game_df['Global_Sales'] > 70].index, inplace=True) #把有NaN的rows都去掉 game_name = game_df['Name'] game_df = game_df.drop(columns=['Name']) game_df.dropna(inplace=True) #把User_Score中未定(tbd)去掉並將分數統一成0-100分 game_df.drop(game_df[game_df.User_Score == 'tbd'].index, inplace=True) game_df['User_Score'] = game_df['User_Score'].apply(lambda x: float(x)*10) ``` ![](https://i.imgur.com/kmxopkD.png) 將分數轉換為分數指征 ``` def score_to_indication(df, row): df[row] = np.where(df[row] < 20, 1, df[row]) df[row] = np.where(df[row].between(20, 49), 2, df[row]) df[row] = np.where(df[row].between(50, 74), 3, df[row]) df[row] = np.where(df[row].between(75, 89), 4, df[row]) df[row] = np.where(df[row] > 89, 5, df[row]) score_to_indication(game_df, 'Critic_Score') score_to_indication(game_df, 'User_Score') ``` 抓Publisher, Developer, Critic_Score, User_Score ``` gameP_df = game_df[['Publisher', 'Developer', 'Critic_Score', 'User_Score']] ``` ![](https://i.imgur.com/0eCrofH.png) 文字標籤做One hot encoding ``` game_df = pd.get_dummies(game_df) gameP_df = pd.get_dummies(gameP_df) ``` 全部 ![](https://i.imgur.com/E97GfGR.png) Publisher和Developer ![](https://i.imgur.com/781SRLF.png) 把Critic_Score(答案)抓出來 ``` game_target_critic = game_df['Critic_Score'] game_target_critic = game_target_critic.to_numpy() ``` ![](https://i.imgur.com/kqEyDmd.png) 把User_Score(答案)抓出來 ``` game_target_user = game_df['User_Score'] game_target_user = game_target_user.to_numpy() ``` ![](https://i.imgur.com/qaGpOFO.png) 把除了答案之外的資料整理好 ``` game_data = game_df.drop(columns=['Critic_Score', 'User_Score']) game_data = game_data.to_numpy() gameP_data = gameP_df.drop(columns=['Critic_Score', 'User_Score']) gameP_data = gameP_data.to_numpy() ``` ![](https://i.imgur.com/AsSQ9EF.png) ## 訓練knn ### 步驟 1. 分割資料 2. 測試knn曼哈頓距離 3. 測試knn歐幾里得距離 4. 測試決策樹分類器 ### 過程把資料割成訓練用跟測試用 ``` x_train, x_test, y_train, y_test = train_test_split(game_data, game_target_critic, test_size = 0.2) ``` knn曼哈頓距離 ``` knn = KNeighborsClassifier(p=1) knn.fit(x_train, y_train) print(knn.predict(x_test)) print(y_test) ``` knn歐幾里得距離 ``` knn = KNeighborsClassifier(p=2) knn.fit(x_train, y_train) print(knn.predict(x_test)) print(y_test) ``` 決策樹分類器 ``` knn = DecisionTreeClassifier() knn.fit(x_train, y_train) print(knn.predict(x_test)) print(y_test) ``` 我這邊先寫成一個腳本，方便等下好用一直用 ``` def do_knn(data, target): x_train, x_test, y_train, y_test = train_test_split(data, target, test_size = 0.2) #knn曼哈頓距離 knn = KNeighborsClassifier(p=1) knn.fit(x_train, y_train) print(knn.predict(x_test)) print(y_test) print('訓練集: ',knn.score(x_train,y_train)) print('測試集: ',knn.score(x_test,y_test)) #knn歐幾里得距離 knn = KNeighborsClassifier(p=2) knn.fit(x_train, y_train) print(knn.predict(x_test)) print(y_test) print('訓練集: ',knn.score(x_train,y_train)) print('測試集: ',knn.score(x_test,y_test)) #決策樹分類器 knn = DecisionTreeClassifier() knn.fit(x_train, y_train) print(knn.predict(x_test)) print(y_test) print('訓練集: ',knn.score(x_train,y_train)) print('測試集: ',knn.score(x_test,y_test)) ``` ### 用全部的資料 #### 推測Critic_Score ![](https://i.imgur.com/k949cax.png) #### 推測User_Score ![](https://i.imgur.com/Ap21BDU.png) ### 只用發行商跟開發者的資料 #### 推測Critic_Score ![](https://i.imgur.com/kimZIcm.png) #### 推測User_Score ![](https://i.imgur.com/sHRjphI.png) ### 小結只用發行商跟開發者的資料竟然真的比較準耶！好扯喔！所以好公司可能真的有比較大的機率出好遊戲？但以結果來說準確度大概落在5到6成，並沒有很高。 ~~不知道是我認知錯誤還是真的怪怪的，決策樹訓練集的分數不是應該要是1嗎？但是只有發行商跟開發商的資料竟然不是！好奇怪，之後有空再來debug，不然我目前是沒找到。~~ ## 結論這次處理這個遊戲銷售跟評分的資料集可以發現，光是憑藉著遊戲基本資料跟銷量很難看出一款遊戲到底是好與壞，和年份的相關性更是低到不行。因此我們可以說，網路上酸民們常說的「老遊戲比較好玩」可信度比較低，有可能是因為會被我們記到現在的老遊戲都是經典中的經典而造成倖存者偏差，其實以前也不乏有許多糞Game，看看[敖廠長](https://www.youtube.com/channel/UCCkMW93Am1pLfk2nZFKAmbQ)的影片就知道了。令人意外的是，網民們常講的另一句「某某出品，必屬精品」在這次的結果中發現可能稍微有可信度可言，但也就稍微而已。畢竟訓練出來的準確度不夠高，而且「沒有永遠的英雄，也沒有永遠的狗熊」嘛！同樣都是EA，有出過[戰地風雲](https://www.metacritic.com/game/pc/battlefield-2)，也出過[冒險聖歌](https://www.metacritic.com/game/pc/anthem)；同樣都是蠢驢，有出過[巫師](https://www.metacritic.com/game/pc/the-witcher-3-wild-hunt)，也有出過[CyberPunk2077](https://www.metacritic.com/game/pc/cyberpunk-2077)。就算說工作室也一樣，同樣都是DICE，能出[戰地風雲3](https://www.metacritic.com/game/pc/battlefield-3)，也能出[戰地風雲2042](https://www.metacritic.com/game/pc/battlefield-2042)；同樣都是UbiSoft蒙特婁工作室，能出[刺客教條III](https://www.metacritic.com/game/pc/assassins-creed-iii)，也能出[刺客教條：大革命](https://www.metacritic.com/game/pc/assassins-creed-unity)。只能說要判斷一款遊戲的好壞還是要一些比較比較主觀的指標，難怪YouTube一天到晚問我剛剛看的影片給我什麼感覺，有那些資料可能會比較好判斷一些。 ## 心得已經很久沒有處理過資料集了，想想上次還是大一時。我發現許多我在大一處理那個資料集時遇到的問題現在依然遇到了，像是：匯入資料、把標籤變數字、評估訓練結果···等。但現在能夠游刃有餘的解決它們，不像是大一時不知如何是好。雖然有些方法還是要上網參考一下，但我覺得自己比上次有十足長進，對於程式的理解更深刻了，處理環境也不再是問題。期末我應該會找其他資料集，不然這個資料集對於我的研究方向訓練出來這個表現實在是不堪入目，希望自己期末可以想到些不錯的主題。最後附上[本篇 Jupyter notebook](https://o365nutc-my.sharepoint.com/:f:/g/personal/s1410832027_ad1_nutc_edu_tw/EvVAre2byC1DmDbyTLfzLD0BoMeEXMbpvghwGytjcza-QA?e=gdZqjp) ## 參考資料 * [Video Game Sales with Ratings | Kaggle](https://www.kaggle.com/datasets/rush4ratio/video-game-sales-with-ratings) * [Video games sales - simple linear regression](https://www.kaggle.com/code/allyjung81/video-games-sales-simple-linear-regression) * [Video games sales with score - EDA and Stat test](https://www.kaggle.com/code/artemsolomko/video-games-sales-with-score-eda-and-stat-test) * [Metacritic](https://www.metacritic.com/browse/games/release-date/available) * [Metacritic | 維基百科](https://zh.wikipedia.org/wiki/Metacritic) * [ESRB](https://www.esrb.org/) * [ESRB | 維基百科](https://zh.wikipedia.org/wiki/%E5%A8%9B%E6%A8%82%E8%BB%9F%E4%BB%B6%E5%88%86%E7%B4%9A%E5%A7%94%E5%93%A1%E6%9C%83) * [jiaweichang | GitHub](https://github.com/jiaweichang/biography/blob/master/slides/11002/Data%20Mining/%5BL3%5D%20Classification%20Exercise.pdf) * [API reference | pandas](https://pandas.pydata.org/docs/reference/index.html) * [初學Python手記#3-資料前處理( Label encoding、 One hot encoding)](https://medium.com/@PatHuang/%E5%88%9D%E5%AD%B8python%E6%89%8B%E8%A8%98-3-%E8%B3%87%E6%96%99%E5%89%8D%E8%99%95%E7%90%86-label-encoding-one-hot-encoding-85c983d63f87)