Predict number of views for videos on Bilibili

# Predict number of views for videos on Bilibili ``` Course: DS-GA 1001 Introduction to Data Science Authors: Junze Li (jl11390) Di He (dh3171) Xi Yang (xy2122) Lining Zhang (lz2332) Team Name: Michelin Cook Project Duration: October 2020 - November 2020 ``` > [Github Link](https://github.com/hedygithub/1001_Project) ## 1. Business Understanding ### 1.1 Problem Definition User-generated content is becoming more and more popular, especially in the young generation. Youtube, the biggest American video-sharing service, has over 2 billion monthly active users and over 60% of businesses use it as a channel to do marketing campaigns.[1] Similarly, as internet users in China has become more fluent in content production, vlogs are also on the rise in China. [2] [Bilibili](https://en.wikipedia.org/wiki/Bilibili), a Tencent-and-Alibaba backed video platform is now one of the most significant plays in the creator-driven content platforms. Founded in 2010, Bilibili began as an animation-oriented digital video sharing platform and is considered by some industry insiders to be the closest entity to the “YouTube of China”. After ten years of expanding, Bilibili has gathered more than 1 million content creators and 200 million monthly active users.[3] With the tremendous traffic Bilibili has, many businesses, especially small and medium vendors, start to launch commercial cooperation with the famous vloggers, promoting their products in their videos. [4] Bilibili's operation team is also trying to identify promising vloggers, helping them to promote their videos so as to build a more robust ecosystem and user adhesiveness. However, businesses only come after the influential vloggers. Our team identifies number of views to be the most essential key performance indicator for vloggers. We believe those independent vloggers and MCNs can boost their popularity if we are able to predict and explain number of views of their upcoming videos. Hence, our team defines the problem to be predicting the number of views for videos on Bilibili. ### 1.2 Business Analysis Informative and precise marketing is dominating the way businesses conduct marketing campaigns. Currently, there are at least four well-developed data analysis platform specifically targeting Bilibili.[5] Vloggers, MCNs, advertisers and brands are their main customers, with monthly fee ranging from $50 to $1000. All these platforms offer sophisticated monitoring, searching as well as data dashboard in almost all dimensions. However, none of them is providing any kind of prediction, neither on the trend nor on the potential popular videos. With an effective prediction model on the number of views, these platforms can offer additional premium functions which help advertisers to evaluate the volume of exposure and inform vloggers on which attributes to improve so as to boost the popularity of their videos. Not only these platforms can attract more paid users, they can also up-sell their existing customers to boost their revenue. Of the 1 million vloggers on Bilibili, assumably top 5% of them, 50 thousand in other words, are the targeted customers who have the attempt to improve their video performance. Along with over 2 thousand active MCNs and hundreds of brands, the estimated monthly income boost should exceed 600 thousand dollars for the whole market. ### 1.3 Academic Fit There are various aspects that can affect the number of views, such as its title, number of followers the vlogger has and even the bias in recommendation system. In general, we categorize influential factors into three categories, namely video attributes, vlogger attributes and other confounders. Although we are unable to capture all the confounders, we believe building a regression model is a feasible and effective solution of the problem, by using various video attributes and vlogger attributes and setting the target variable to be the number of views. Given that there can be quite a lot features for a video, even in different forms such as texts and numbers, machine learning is a great fit both in its power of prediction and interpretaion. Our team hopes to yield predictive results by undergoing the process of data collecting, feature engineering, model selection and evaluation. Then we can use feature importances and visualization tool such as [SHAP](https://shap.readthedocs.io/en/latest/) to interpret fluctuation in the results. ### 1.4 Project Scope With respect to feasibility, our team limits the coverage of our project to only precdicting number of views for videos in the food related videos. We also exclude the cover image and video itself from our feature list because iamge and video data are enormous in size and difficult to process. As we are targeting vloggers with large traffic, our team also decides to focus on those who has considerable popularity. ## 2. Data Understanding ### 2.1 Data Crawling Since our project idea is quite novel and we also limit the scope to videos in food category, there is no available dataset or data source online. Our team sets up our own [crawler](https://github.com/hedygithub/1001_Project/tree/main/crawler) by using Requests and BeautifulSoup packages to collect video and vlogger information through the rank page. From Oct.9 to Oct.17, the crawler constantly collect the links on the ranking list of food category which lead to these vloggers' personal page. ![Rank page](https://i.imgur.com/rYsMzVd.png) After collecting 292 vloggers' personal page links, we then get corresponding unique IDs, vlogger data, video links and video data by calling [Bilibili APIs](https://github.com/SocialSisterYi/bilibili-API-collect). ### 2.2 Data Description After crawling the data, we end up with 23450 videos and 292 vloggers. There are in total 61 columns, including 58 features, 2 unique IDs and the target variable. Detailed features and explanations are listed in [7.2.1 Raw data description](https://hackmd.io/V_2QIoN4R5aQIuOQOCBMiA?both#721-Raw-data-description). By limiting to authors who have at least shown up on the list once, this approach guarantees the authors are influencial enough. ![](https://i.imgur.com/47odert.png) ### 2.3 Exploratory Data Analysis #### 2.3.1 Target Variable Analysis Based on the distribution of target variable "play", it can be easily found that the target variable is highly skewed, with most of the data clustered at the right size for click volume. We also fit the data to the normal distribution as shown of the black line in the visualization below. Thus, a log transformation is performed on the target variable before modeling. ![target variable](https://i.imgur.com/s3Ak2DH.png =500x) #### 2.3.2 Features Analysis - Numeric Feature Analysis For numeric features, the correlations between each feature and the target variable "play" are explored through visualization. As shown in the picture below, most of the numeric features are positively correlated with the target variable (i.e. the number of video reviews "video_review" is highly correlated with the target variable). Also, some features are highly correlated with each other, like "follower" and "likes" (0.86), so they should not be included in the model of linear regression at the same time due to the problem of multicollinearity. ![numeric features](https://i.imgur.com/rIGlqiN.png =400x) - Categorical Feature Analysis Video types are transformed into dummy variables. As shown in the picture below, we only include the most frequently-appeared video types, which are related to the topic of food or life. ![typeid](https://i.imgur.com/UnDP2P9.png =400x) #### 2.3.3 Time Distribution of Records According to the visualization of time distribution shown below, it is clearly showed that more videos were created during a later time period. Thus, the method for splitting the dataset is based on time, which has relatively smaller window sizes for validation and testing datasets. ![time distribution](https://i.imgur.com/PHKTyOQ.png =400x) ## 3. **Data Preparation** ### 3.1 Data Cleaning #### 3.1.1 Data Type, Missing Values and Outliers The original dataset contains information of 23450 videos, which has some missing values and outliers because of the erroneous data collection. Several steps are included in data cleaning. The data types of some attributes are converted, including converting float into integer and converting timestamp into datetime. Videos that have missing value are removed because of rare occurrences. The outliers of attribute "length" are deleted to prevent the influence on models and 90% quantile of length are kept. #### 3.1.2 Time Period and Types of Videos It is assumed that number of views of videos will stabilize in one month after being posted. Thus, although the original dataset contains videos posted before Oct.1, 2020, videos posted between Sept.1, 2020 and Oct.1, 2020 have an increasing number of views, and only videos posted before Sept.1, 2020 are kept. Moreover, videos posted before Sept.1, 2019 are also dropped since there aren't many of them and they have less connection with videos posted in the future, which may result in poor estimation for the number of reviews of those videos. Since we focus on food related videos, only videos with type in food category and life category are maintained. #### 3.1.3 Irrelevant Features In addition, features that are irrelevant to our research are deleted, such as "bvid" and "author". Attribute "follower" is dropped because it is volgger's total number of followers when data is scrapped and cannot represent the number of followers when the video is posted. Attributes "likes" is also dropped for the same reason. ### 3.2 Log Transformation on Target Variable In the previous part of Exploratory Data Analysis (EDA), the distribution of target variable "play" showed that the target variable is highly skewed, with most of the data clusters at the low range for click volume. Thus, a log transformation is performed on the target variable for the modeling part. ![log_play](https://i.imgur.com/w0Up51X.png =400x) ### 3.3 Feature Engineering #### 3.3.1 Video-related Features - last_3month_play and last_3month_count To extract more information of the play history of the vlogger, we create two numeric features "last_3month_play" and "last_3month_count" to be the indicators of the historical click volume and posting frequency. We first group the dataset by the vloggers "author" and then aggregate the data by mean or count in a window size of 90 days from the date the video was created. The corresponding "last_3month_play" and "last_3month_count" vary for different videos, even under the same vlogger. - last_3month_comment and last_3month_review We create two numeric features "last_3month_comment" and "last_3month_review" to be the indicators of the historical number of video comments and reviews. The approach is similar as above and the The corresponding "last_3month_comment" and "last_3month_review" should vary for different videos as well. - Dummy Variables for typeid For the feature of different types of the video "typeid", it has 7 distinct values like "daily", "food", "food_country", "food_detect", "food_eval", "food_record", and "funny". We created dummy variables for this categorical feature and concatenated them into the dataset. - Dummy Variables for tag Each video can have multiple tags that are labeled by the author and viewers. It would be highly sparse if we consider all the tags that appear. Hence, the top 20 frequent tags are being transformed to dummy variables. Videos with the tag will have value 1 for the tag or 0 otherwise. - day_time and is_weekend Based on the time when the video was created, we create features for "year", "month", "day", "hour", "weekday" (Monday to Sunday), and "number_of_week" (1 to 52). To further aggregate the time-related information, two binary features "day_time" and "is_weekend" were created to capture the difference of time periods. "day_time" has value of 1 for time from 6 a.m. to 6 p.m., and value of 0 for otherwise; "is_weekend" has value of 1 for time from Saturday to Sunday, and value of 0 for otherwise. #### 3.3.2 Vlogger-related Features - food_ratio and life_ratio "food_ratio" (percentage of food category videos in all videos) and "life_ratio" are two important features calculated from all categories. Videos are classified into 15 categories in Bilibili website. Because we focus on food related videos and most of them are classified into food category and life category, "food_ratio" and "life_ratio" are selected as features, which can represent the focus of the vlogger. For instance, a vlogger with high food ratio indicates its dedication to the videos in food category. #### 3.3.3 Text Features Video titles and descriptions are also important for our prediction as it can affect video plays through attracting viewers’ attentions. Further, in business perspective, vloggers have the most control on the content of titles and descriptions. Therefore they may take actions suggested by our interpretations based on the feature importance. To transform text to features, we use [TF-IDF](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html?highlight=tf%20idf#sklearn.feature_extraction.text.TfidfTransformer) and [TextRank](https://github.com/fxsjy/jieba/blob/master/jieba/analyse/textrank.py) to get key words from titles and descriptions separately. To customized for Bilibili, special stop words are added to normal Chinese stop words and define our Bilibili key words to extract more video related information. One concern of text features is that they have high dimensions and sparsity. To keep the interpretability of our features, we did not use SVD to decrease the dimensions. Instead, top 100 keywords are chosen directly from TF-IDF and Text Rank model separately. To alleviate the sparsity, we use [Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html) to get similar words of the chosen words and took those similar words into counts. Finally, 145 key words are obtained for descriptions and 125 key words for titles. ![Text Features Extraction](https://i.imgur.com/ZFYEWpZ.png) ### 3.4 Train, Valid and Test Split The dataset contains videos posted between Sept.15, 2019 and Sept.1, 2020. We split the dataset into three disjoint sets on time dimension to make sure there is no overlap between training, validation and test. Videos posted before July 1, 2020 are selected as train set, videos posted between July 1, 2020 and Aug. 1, 2020 are selected as validation set, and videos posted between Aug. 1, 2020 and Sept.1, 2020 are selected as test set. ![](https://i.imgur.com/qYbwVPV.png) Selecting videos from a more recent time as test set allows us to evaluate model performance on a dataset in the near future, which is highly consistent with recent situation. Cross-validation is not used in order to prevent data leakage in the validation process. After conducting model selection in validation set, the final model is retrained on the combination of training set and validation set, and then fitted into test data to get the final result. ### 3.5 Pre-intermediate Data Dedcription After data cleaning and feature engineering, the final dataset contains 14820 videos and 311 features. Detailed features and explanations are listed in [7.2.2 Data description after data cleaning and feature engineering](https://hackmd.io/V_2QIoN4R5aQIuOQOCBMiA?both#722-Data-description-after-data-cleaning-and-feature-engineering). ## 4. Model and Evaluation ### 4.1 Modeling Description We first explore three models with different input features to try out the performance: * Use features excluding text features extracted from titles and descriptions to predict the video plays. * Use features including text features extracted from titles and descriptions to predict the video plays. * Use only text features extracted from titles and descriptions to predict video plays differences percentage compared with last three month plays. The last two methods are intended to check whether the model can take advantage of useful information contained by titles and description. However, the outcome of the second method is no better than the first and the last method has even negative R-Squared. Hence, we think the signals contained by the text features are no more than the noises it contained. It may be caused by the limited amount of videos we have and the text analysis methods we use. Further, the popular video topics are changing rapidly, which decreases the useful signal contained by current titles and descriptions. As a result, the exploration concludes that text features bring no value to our model. Hence, we carry on with only the none-text features in the following part. ### 4.2 Model Selection #### 4.2.1 Baseline Model From business perspective, the click volume of a vlogger’s incoming video is mostly evaluated by plays of the vlogger’s recent videos. Therefore, we directly use the average plays of a vlogger’s last-three-month videos "last_3month_play" as our baseline. To avoid the infinite value appeared during the log transformation, the average click volume which is zero originally is set to be 1 in the calculation. #### 4.2.2 Other Models As our prediction lies in the regression problem, we first try out some linear regression models as they are simple, basic and easy to interpret. In particular, we use [LASSO](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html?highlight=lasso#sklearn.linear_model.Lasso) and [Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html?highlight=ridge#sklearn.linear_model.Ridge). Considering that we have many features, we plan to use LASSO to down select those features because LASSO uses l1-norm regularization. Ridge could prevent overfitting though the l2-norm regularization. Hence, we also explore Ridge as one of the basic model. Because we have lots of features, more complex and flexible models could have a better performance. Therefore, except basic [Decision Tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html?highlight=decision%20tree#sklearn.tree.DecisionTreeRegressor), we also tried several tree-based learning algorithms, such as [Random Forests](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html?highlight=randomforests), [LightGBM](https://lightgbm.readthedocs.io/en/latest/), [XGBoost](https://xgboost.readthedocs.io/en/latest/). The former is based on bagging, the two latter is based on boosting. ### 4.3 Evaluation Metric For a regression problem, there are several metrics that we can use to evaluate our model performance, such as [mean absolute error(MAE)](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html#sklearn.metrics.mean_absolute_error), [mean squared error (MSE)](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html?highlight=r%20squared#sklearn.metrics.mean_squared_error), and [coefficient determination (R-Squared)](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score). We chose MSE as our main evalutaion metric for hyper-parameter tuning. For the sake of communication, we also chose R-Squared as another evalutaion metric to better illustrate the performance. R-Squared stands for the proportion of variation in the outcome that can be explained by the predictor variables. ### 4.4 Model Tuning #### 4.4.1 Tuning Process We use a training set to train the model and use a valid set to choose the optimal hyperparameters. This part illustrates this process by the example of LightGBM. - Step 1, we use default parameters of LightGBM and tune "n_estimators". From the picture below, we find: ![Tune 1](https://i.imgur.com/MVycrmn.png) - When "n_estimators" is 30, the validation set gets the best score 1.02. - LightGBM is a complex model, so overfitting is very likely: the training MSE is continuously decreasing, while validation MSE decreases at first but gets higher after 30. Based on bias-variance tradeoff, the model tends to overfit so we need to reduce its bias to reduce the total error. - Step 2, to decrease the bias, we tune the LightGBM model from the following perspective: | Method | Parameters | | ----------------- | ------------------------------------------------------------ | | Reduce Complexity | "max_depth", "number_leaves", "min_child_sample", "min_child_weight" | | Add Randomness | "colsample_bytree"， "subsample"，"subsample_freq" | | Increase Regularization | "reg_alpha", "reg_lambda" | - Step 3, we tune the "learning_rate" and "n_estimators". And from the new MSE plot, we find: ![Tune 2](https://i.imgur.com/HT5szuH.png) - When "n_estimators" is 30, the validation set gets the best score of 0.96, which is better than step 1(MSE 1.02). - Training dataset has a more flatten decreasing curve, and the gap between training and validation is smaller. It means the model is less overfitting and we get a better balance in the bias-variance tradeoff. #### 4.4.2 Tuning Results After tuning, we get the following outcome. We choose LightGBM as our final model as it has lowest validation MSE and highest R-Squared. | Model | Default: MSE-Train | Default: MSE-Valid | Tuned: MSE-Train | Tuned: MSE-Valid | Tuned: R2-Valid | | -------------- | ------------- | ------------- | --------------- | --------------- | --- | | Baseline |1.293 |1.481 | - | - | - | | LASSO |2.099 |2.165 |2.080 |1.678 |0.618 | | Ridge |2.080 |2.014 |2.062 |1.647 |0.625 | | Decision Tree |0.000 |2.188 |0.907 |1.096 |0.685 | | Random Forests |0.112 |1.075 |0.764 |1.023 |0.701 | | LightGBM |0.501 |1.050 |0.734 |0.964 |0.715 | | XGBoost |0.184 |1.181 |0.849 |1.033 |0.711 | ### 4.5 Final Model Our final model inherits parameters from the tuned LightGBM model and uses the combination of training and validation dataset to train. As the picture on the left, we find: - Our MSE of testing dataset is 1.183 and R-Squared is 0.628, which is a little bit worse than validation. It may be caused by the assumption we make by the video plays lifecycle. Compared with validation videos, testing may have less plays because the time we crawled out data from the API, which leads bias in the dataset. - It seems our prediction is more "conservative": the model is more likely to overestimate the lowest plays and underestimate the highest ones. Maybe there are more unpredictable noises in those videos. ![Testing](https://i.imgur.com/6r7PlUr.png) From the right picture, we know previous plays are the most important feature. However, other user-controllable features also play an important role, such as video length, video counts and created week. ### 4.6 Further Interpretation of Model #### 4.6.1 Feature Impacts Explanation - We use SHAP to further interpret of the Final LightGBM model. SHAP value means the impact on our target variable: log-video plays. From the SHAP visualization below, we can find out not only the negative or positive impacts, but also the scale of impacts. It will largely help us to explain our results to our users. For example, we can observe that short videos have a negative impact on the number of plays; Videos that are recording daily life, showing funny things or evaluating food tend to be more popular. ![Top 15](https://i.imgur.com/qkR7dFj.png =600x) The impact of some features, such as "food_ratio" and "life_ratio", is ambiguous on the target variable. Therefore, we select some user-controllable features to have a deeper interpretation. - The below pictures are about the SHAP value of "length" and "last_3month_count". Understanding their impact on plays could help vloggers better adjust the time and efforts they devoted to the videos. We find: - From the left picture, longer videos are usually more popular. The shape is like an elbow，which means when the length is higher than like 420 seconds (6 mins), the impact is weakened. It suggests that it is more effective to make longer videos, at least beyond the threshold of 6 minutes. - From the right picture, we know Bilibili users may like "diligent" vloggers. If a vlogger posts less than one video per three day (30 videos per three months), it will negatively affect the plays of the vlogger's video. !["length" and "last_3month_count"](https://i.imgur.com/FJMITEr.png) - The below pictures are about the SHAP value of "food_ratio" and "life_ratio". Understanding their impacts on plays could suggest vloggers better ways to allocate the proportion of video types they post. We find: - From the density, we know that most vloggers in our chosen channel have posted more food videos than life videos. - From the yellow cycle in the left picture, about 80%~95% food video percentage could lead to a higher play. - From the black cycle, if an author's video type distribution is too polarizing, the plays could be decreased. For example, if an author's videos are all about food, the video plays may be not good as adding some life related videos. Maybe Bilibili users want to know more about the author rather than food itself. - From the yellow cycle in the right picture, we know adding life video percentage could help increase the plays. Bilibili users may want the authors to be more approachable and share their daily life. !["food_ratio" and "life_ratio"](https://i.imgur.com/gQmid7o.png) - The below left picture is about the SHAP value of "hour". The right picture is from a Bilibili user analysis report [6], and shows hourly trends about "videos posted" and "live comments" that could stand for the traffic of Bilibili. Understanding the hour's impact on plays could help vloggers know when to post the video. We found: - The trend SHAP value of "hour" is similar with the outsourced reports. Both of them suggest noon and evening is the best time to post videos. - Most vloggers may have known this trend, because most videos are posted in those time periods. - Our SHAP value shows incontiusly impacts of hour 0 and hour 23, which is caused by the way we handle "hour" features. We didn't use one-hot encoding to change it into 24 categorical features and the one numerical feature shows distorted distance for hour 0 and hour 23. !["hour"](https://i.imgur.com/sNxWNP6.png) #### 4.6.2 Interpretation for a Certain Vlogger Now let's take a cetain vlogger's videos as a demo on how to interpret its performance. The left picture shows all videos of the chosen vlogger in our testing dataset. The right picture shows the SHAP plot for a chosen video of the vlogger. It could help business users better understand how the prediction is derived and how to improve. - From the picture below, we can see a similar shape with the final model part. Again, it shows that our model is "conservative" and some signals are beyond our models' scope. - Generally speaking, our model captures signals which are overlooked by the baseline, which make our model outperforms the baseline. - From the SHAP picture, we know the vlogger posted short videos frequently. It may be better to post longer videos and add some food evaluation related videos to get a higher score. ![](https://i.imgur.com/Mr2cv0Y.png) #### 4.6.3 Business Value Summary In summary, our model could: - Help advertisers to evaluate the future video plays of their targeted KOL vloggers, and then decide whether to have a cooperation with them. - Help MCN companies and individual vloggers better understand the reasons affecting video plays and take actions to increase them. ## 5. **Deployment** ### 5.1 Deployment Plan The target user of our prediction model is the advertisers and the multi-channel network (MCN) companies, because they need to identify the promising vloggers and coorperate with them for product promotion. - To deploy this predictive model, a dynamic web crawler first needs to be implemented to acquire the video and vlogger's information and store it into a database for later information retrival. - Once the required data is collected, it will be passed through the data processing and feature engineering pipeline for the preparation of the modeling. - A prediction module, connected with both the data preparation pipeline and the prediction model database, produces the prediction result and store it into a click volume prediction database. For the prediction module, the utilized model also needs to be optimized to be light-weighted and integrated to the complete pipline for prediction. - Moreover, a model update module should be connected to the prediction model database to guarantee the prediction model is updated at real time in a regular routine. An interactive module should also be connected to the click volume prediction database to help the users get the prediction result once they specify the video or vlogger which they want to predict the click volume of them. ![deployment](https://i.imgur.com/2M001Gr.png) ### 5.2 Ethical Consideration and Operational Risks Once the predictive model is implemented in the production system, the advertisers and the multi-channel network (MCN) companies will reach out to the promising vloggers at the very early stage, seeking any potential opportunities for product promotion. It will adversely affect the ecosystem of Bilibili as an online video sharing platform, and the commercialized content of the vlogger's videos will also harm the popularity of the vloggers and diminish their impact for product promotion in turn. In addition, our model is only used for the purpose of prediction, which may have bias, so our team could not be responsible for any results caused by this prediction. ## 6. Discussion ### 6.1 Innovative Approaches In this project, Tags of videos are selected as video attributes. Tags are added by users and highly represent their descriptions of each video. If the hits of videos with a specific tag are pretty high,it means this kind of videos are popular among viewers. In addition to attributes scrapped directly from the website, several innovative features are created through feature engineering. Keywords extracted from titles and descriptions are created as video attributes. Those attributes give us information about what kinds of titles will and descriptions can attract more users to watch the video. As long as those attractive words are included in videos’ titles and descriptions, the popularity of videos will boost to some extent. Attributes related to time, such as “day_time”, and “is_weekday”, help vloggers know the best time to post videos. Moreover, SHAP is used in our project to explain the output of our models，which helps us have a better understanding about how features impact the target variable. ### 6.2 Potential Improvements There are some potential problems that can be improved. At first, the dataset is biased. Only vloggers on the top of rank will be included. However, those vloggers are the most popular ones and cannot represent all the vloggers. Due to time concerns, data scraped from the website is limited. In the future, more data can be scrapped from the website for training models, including more vloggers and more videos of each vlogger. At the same time, text data is also limited, increasing difficulty in extracting useful keywords from videos’ titles and distributions. Since data is scrapped only once, we only can obtain the information of videos when data is scrapped. For example, attributes “follower” is vlogger’s total number of followers at that moment. Videos with the same author have the same value in attribute "follower". The number of followers when each video is posted is unknown. However, if data are scrapped many times, this problem can be solved. ### 6.3 Future Work As for future work, we can expand scope to videos in different categories instead of only limit to food category. A more effective algorithm could be used to deal with text in order to extract more useful keywords. At the same time, images and audios of videos could be added as important features to help us predict the number of views more precisely. In addition to target at different categories, we also can target at each vlogger and predict the number of views of different kinds of videos posted by a specific vlogger. This can help each vlogger better understand their own strengths and weaknesses. ## 7. Appendix ### 7.1 Reference [1] [10 Youtube stats every marketer should know in 2020](https://www.oberlo.com/blog/youtube-statistics) [2] [Reminder: Vlog trending in China](https://www.chinainternetwatch.com/29525/vlog-marketing-rise/) [3] [Bilibili Stats and Facts(2020)](https://expandedramblings.com/index.php/bilibili-statistics-facts/) [4] [How Do Vloggers Make Money?](https://vloggingextra.com/how-vloggers-make-money/) [5] Four data analysis platform specifically targeting Bilibili: [FeiGua](http://bz.feigua.cn/), [Xiaoxiao data](https://xxkol.cn/), [NewRank](https://xz.newrank.cn/) and [DongZhan](https://www.blbldata.com/) [6] [Bilibili user analysis](https://zhuanlan.zhihu.com/p/93581509) ### 7.2 Table #### 7.2.1 Raw data description | Feature | Type | Explanation | Example | | -------- | -------- | -------- | -------- | | bvid | string | Unique id of video | BV1qy4y1C7fS | | author | string | Vlogger name | 拜托了小翔哥 | | follower | float | Vlogger number of followers | 2840710 | | likes | float | Vlogger total number of likes | 4995407 | | category[1-15] | string | Categories where the vlogger post | 美食 | | category[1-15]_count | int | Number of videos in the category | 26 | | title | string | Video title | 小伙网上学来的饺子馅秘方... | | description | String | Video description | 记得戴上耳机，深夜躲在被窝里... | | created | string | Video post time | 2020-10-15 17:53:29 | | length | int | Video length in seconds | 396 | | play | float | Video number of views | 1040090 | | comment | float | Video number of comments | 2188 | | typeid | int | Video category id | 22 | | video_review | float | Video number of Danmakus | 7488 | | tag[1-19] | string | Video tags, max number of tags is 19, usually less than 10 | 深夜食堂, 家常菜 | #### 7.2.2 Data description after data cleaning and feature engineering | Feature | Type | Explanation | Example | | ------------------------------------------------------- | ---------- | -------------------------------------------------------------- | ---------------- | | video type[1-7]：daily/food/food_country/ food_detect /food_eval/ food_record/ funny | boolean | Video types, each video only classified in one type | 1 | | food ratio/life ratio | float | Percentage of food category videos of each author | 0.451613 | | day_time/is_weekend | boolean | Whether video is posted in day-time/weekend | 1 | | weekday | int | Video post day | 6 | | hour | int | Video post hour | 17 | | created_iso_yr/created_mo/created_mo_wk/created_iso_day | int | Video post year/month/week/day | 2020/9/13/37 | | date | datetime | Video post date | 2020-09-13 | | length | int | Video length in seconds | 396 | | last_3month_play | float | Log mean of play of videos in 3 month posted before this video | 15.100171 | |last_3month_comment|float|Log mean of number of videos' comments in 3 month posted before this video|9.711419| |last_3month_review|float|Log mean of number of videos' reviews in 3 month posted before this video|10.674799| | last_3month_count | int | Number of videos in 3month posted before this video | 11 | | tag[1-19]: e.g 深夜食堂, 家常菜 |boolean | Video tags, max number of tags is 19, usually less than 10 | 1 | | similar_desc XX[145]: e.g.牛排 (beef)| float| Text features extracted from video descriptions| 1.11| | similar_title XX[125]: e.g. 好吃 (yammy)| float| Text features extracted from video titles| 0.1

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.