Statistics Final Project

--- tags: homework --- # Statistics Final Project > ID: 111550013 > Name: 施羿廷 > video: [link](https://youtu.be/UbpZvTQ9eMo) > github repository: [link](https://github.com/konchinshih/vtuber-statistics) > for better reading experience: [hackmd link](https://hackmd.io/@konchin/statistics-final-project) ## Motivation **Identifying Factors Influencing Online Success** The online content landscape is highly competitive, and success as a content creator on platforms like YouTube requires a combination of factors. By examining the relationship between views, number of videos, number of subscribers, and involvement with a management company, this research aims to uncover any correlations or patterns that exist. This can help aspiring YouTubers and content creators better understand the factors that contribute to online success and guide their strategies accordingly. ## Questions 1. What's the relationship between the number of subscribers of Vtubers under the management company and that of individual Vtubers? 2. What's the relationship between the view rate of Vtubers under the management company and that of individual Vtubers? 3. What's the relationship between the view rate of Vtubers that have more subscribers and that have less subscribers. ## Data collecting process 1. Do web-crawling to collect the data, top 2000 Vtubers' name, subscriber, view, at the following website https://virtual-youtuber.userlocal.jp/document/ranking ![](https://hackmd.io/_uploads/S1PEPBvwn.png) > The crawing process. 2. Find the number of video for each Vtuber by crawling in youtube. ![](https://hackmd.io/_uploads/ryotPBwD3.png) > Another crawing process. 3. Remove the bad data and convert the data into csv file. ![](https://hackmd.io/_uploads/SJyTvBPwh.png) > The original data after crawing from websites, stored in .json file. ![](https://hackmd.io/_uploads/HJP1YBPvn.png =512x) > The data after removing failed data and converting into .csv file. The implement detail can be found in github repository. ## Descriptive data analysis ### Question 1 ![](https://hackmd.io/_uploads/SJI6LcPv3.png) > The number of subscriber of individual Vtubers. ![](https://hackmd.io/_uploads/rkMT85vv3.png) > The number of subscriber of Vtubers under the management company. ![](https://hackmd.io/_uploads/HJca89Pvn.png) > The boxplot of the number of subscriber. The upper one is individual, the bottom one is under management company. According to the graphs we can say that both are highly skewed and the Vtubers under management company seems to be higher than the individuals in terms of median. ### Question 2 define view rate $\begin{align}=\frac{\text{total views}}{\text{number of video}\times\text{number of subscriber}}\end{align}$ For this question, I remove the top and bottom 5% so that the extreme value will not affact too much. ![](https://hackmd.io/_uploads/Hyc2I9vvh.png) > The view rate of individual Vtubers. ![](https://hackmd.io/_uploads/Skwh8qvwh.png) > The view rate of Vtubers under the management company. ![](https://hackmd.io/_uploads/HkmnIcwDh.png) > The boxplot of view rate. The upper one is individual, the bottom one is under management company. According to the graphs we can say that both are highly skewed and the view rate of individual is seem to be higher. ### Question 3 For this question, I also remove the top and bottom 5%. ![](https://hackmd.io/_uploads/r10sI5wwn.png) > The view rate of Vtubers whose number of subscribers is in the higher half of the data. ![](https://hackmd.io/_uploads/SJsjLqPvn.png) > The view rate of Vtubers whose number of subscribers is in the lower half of the data. ![](https://hackmd.io/_uploads/rkNsIqwP3.png) > The boxplot of view rate. The upper one is the higher half, the bottom one is the lower half. According to the graphs we can say that both are skewed and the view rate of the bottom half is seem to be higher. ## Statistics test ### Question 1 $\mu_1:$ The mean of the number of subscribers of individual Vtuber. $\mu_2:$ The mean of the number of subscribers of Vtubers under the management company. **test 1-1** $H_0:\mu_1=\mu_2$ $H_1:\mu_1<\mu_2$ **test 1-2** $H_0:\mu_1=\mu_2$ $H_1:\mu_1>\mu_2$ Because the sample size is large enough ($\approx 1000$), CLT applies. Using 2 sample z-test. ```python= ind = data.loc[data['isOffice'] == 0] office = data.loc[data['isOffice'] == 1] test11 = ztest( ind['fan'], office['fan'], alternative='smaller' ) test12 = ztest( ind['fan'], office['fan'], alternative='larger' ) print(test11) print(test12) ``` Result: ``` (-12.38087818420684, 1.6584668527953008e-35) (-12.38087818420684, 1.0) ``` Therefore, there's enough evidence to reject the null hypothesis in test1-1, the mean of the number of subscribers of Vtubers under the management company is likely higher than the one of individual Vtubers. ### Question 2 $\mu_1:$ The mean of the view rate of individual Vtuber. $\mu_2:$ The mean of the view rate of Vtubers under the management company. **test 2-1** $H_0:\mu_1=\mu_2$ $H_1:\mu_1<\mu_2$ **test 2-2** $H_0:\mu_1=\mu_2$ $H_1:\mu_1>\mu_2$ Because the sample size is large enough ($\approx 1000$), CLT applies. Using 2 sample z-test. ```python= ind = ind.sort_values( by=['viewRate'] ).iloc[ len(ind.index)*5//100:len(ind.index)*95//100 ] office = office.sort_values( by=['viewRate'] ).iloc[ len(office.index)*5//100:len(office.index)*95//100 ] test21 = ztest( ind['viewRate'], office['viewRate'], alternative='smaller' ) test22 = ztest( ind['viewRate'], office['viewRate'], alternative='larger' ) print(test21) print(test22) ``` Result: ``` (9.462747420407817, 1.0) (9.462747420407817, 1.499781456537576e-21) ``` Therefore, there's enough evidence to reject the null hypothesis in test2-2, the mean of view rate of individual Vtubers is likely higher than the one of Vtubers under the management company. ### Question 3 $\mu_1:$ The mean of the view rate of Vtubers whose number of subscribers is in the higher half of the data. $\mu_2:$ The mean of the view rate of Vtubers whose number of subscribers is in the lower half of the data. **test 3-1** $H_0:\mu_1=\mu_2$ $H_1:\mu_1<\mu_2$ **test 3-2** $H_0:\mu_1=\mu_2$ $H_1:\mu_1>\mu_2$ Because the sample size is large enough ($\approx 1000$), CLT applies. Using 2 sample z-test. ```python= firstHalf = data.iloc[lambda x: x.index < len(data.index)//2] firstHalf = firstHalf.sort_values( by=['viewRate'] ).iloc[ len(firstHalf.index)*5//100:len(firstHalf.index)*95//100 ] secondHalf = data.iloc[lambda x: x.index >= len(data.index)//2] secondHalf = secondHalf.sort_values( by=['viewRate'] ).iloc[ len(secondHalf.index)*5//100:len(secondHalf.index)*95//100 ] test31 = ztest( firstHalf['viewRate'], secondHalf['viewRate'], alternative='smaller' ) test32 = ztest( firstHalf['viewRate'], secondHalf['viewRate'], alternative='larger' ) print(test31) print(test32) ``` Result: ``` (-3.7648012782218454, 8.33407943353823e-05) (-3.7648012782218454, 0.9999166592056646) ``` Therefore, there's enough evidence to reject the null hypothesis in test3-1, the mean of the view rate of Vtubers whose subscribers is in the lower half of the data is likely higher than of whose in the higher half. ## Discussion **The data might not be random enough** I can't find the website that contains all of the Vtuber list. Even if I found it, the data might be out of maintenance. So I decided to go for the top 2000 Vtuber list, even though it didn't include the Vtubers ranking below 2000, the top 2000 can be the representation for the Vtubers that people will watch. **The data collection process might be banned by the website &** **The data might not be accurate** In the first try, I forgot to use a user agent in my crawing program and was banned right away. After some searching, I found a list of fake user agent on github, and the crawling on Vtuber ranking list is pretty successful. However, YouTube banned every user-agent that doesn't use the latest version of browser. And even worse, their website structure will change according to the browser you use and whether or not your platform is on mobile. So I had no choice but using my own user agent to avoid the format of YouTube page changing. To avoid being banned, I had to use a rather long interval between the requests. I managed to get all the data I want but it cost me a whole day, at the mean time the number of views, videos, or subscribers of the Vtubers can be different, so the result might not be that accurate. ## Conclusion 1. The mean of the number of subscribers of Vtubers under the management company is likely higher than the one of individual Vtubers. 2. The mean of view rate of individual Vtubers is likely higher than the one of Vtubers under the management company. 3. The mean of the view rate of Vtubers whose subscribers is in the lower half of the data is likely higher than of whose in the higher half.