Course: DS-GA 1001 Introduction to Data Science
Authors: Junze Li (jl11390)
Di He (dh3171)
Xi Yang (xy2122)
Lining Zhang (lz2332)
Team Name: Michelin Cook
Project Duration: October 2020 - November 2020
User-generated content is becoming more and more popular, especially in the young generation. Youtube, the biggest American video-sharing service, has over 2 billion monthly active users and over 60% of businesses use it as a channel to do marketing campaigns.[1] Similarly, as internet users in China has become more fluent in content production, vlogs are also on the rise in China. [2]
Bilibili, a Tencent-and-Alibaba backed video platform is now one of the most significant plays in the creator-driven content platforms. Founded in 2010, Bilibili began as an animation-oriented digital video sharing platform and is considered by some industry insiders to be the closest entity to the “YouTube of China”. After ten years of expanding, Bilibili has gathered more than 1 million content creators and 200 million monthly active users.[3]
With the tremendous traffic Bilibili has, many businesses, especially small and medium vendors, start to launch commercial cooperation with the famous vloggers, promoting their products in their videos. [4] Bilibili's operation team is also trying to identify promising vloggers, helping them to promote their videos so as to build a more robust ecosystem and user adhesiveness.
However, businesses only come after the influential vloggers. Our team identifies number of views to be the most essential key performance indicator for vloggers. We believe those independent vloggers and MCNs can boost their popularity if we are able to predict and explain number of views of their upcoming videos. Hence, our team defines the problem to be predicting the number of views for videos on Bilibili.
Informative and precise marketing is dominating the way businesses conduct marketing campaigns. Currently, there are at least four well-developed data analysis platform specifically targeting Bilibili.[5] Vloggers, MCNs, advertisers and brands are their main customers, with monthly fee ranging from $50 to $1000. All these platforms offer sophisticated monitoring, searching as well as data dashboard in almost all dimensions. However, none of them is providing any kind of prediction, neither on the trend nor on the potential popular videos.
With an effective prediction model on the number of views, these platforms can offer additional premium functions which help advertisers to evaluate the volume of exposure and inform vloggers on which attributes to improve so as to boost the popularity of their videos. Not only these platforms can attract more paid users, they can also up-sell their existing customers to boost their revenue. Of the 1 million vloggers on Bilibili, assumably top 5% of them, 50 thousand in other words, are the targeted customers who have the attempt to improve their video performance. Along with over 2 thousand active MCNs and hundreds of brands, the estimated monthly income boost should exceed 600 thousand dollars for the whole market.
There are various aspects that can affect the number of views, such as its title, number of followers the vlogger has and even the bias in recommendation system. In general, we categorize influential factors into three categories, namely video attributes, vlogger attributes and other confounders. Although we are unable to capture all the confounders, we believe building a regression model is a feasible and effective solution of the problem, by using various video attributes and vlogger attributes and setting the target variable to be the number of views.
Given that there can be quite a lot features for a video, even in different forms such as texts and numbers, machine learning is a great fit both in its power of prediction and interpretaion. Our team hopes to yield predictive results by undergoing the process of data collecting, feature engineering, model selection and evaluation. Then we can use feature importances and visualization tool such as SHAP to interpret fluctuation in the results.
With respect to feasibility, our team limits the coverage of our project to only precdicting number of views for videos in the food related videos. We also exclude the cover image and video itself from our feature list because iamge and video data are enormous in size and difficult to process. As we are targeting vloggers with large traffic, our team also decides to focus on those who has considerable popularity.
Since our project idea is quite novel and we also limit the scope to videos in food category, there is no available dataset or data source online. Our team sets up our own crawler by using Requests and BeautifulSoup packages to collect video and vlogger information through the rank page. From Oct.9 to Oct.17, the crawler constantly collect the links on the ranking list of food category which lead to these vloggers' personal page.
After collecting 292 vloggers' personal page links, we then get corresponding unique IDs, vlogger data, video links and video data by calling Bilibili APIs.
After crawling the data, we end up with 23450 videos and 292 vloggers. There are in total 61 columns, including 58 features, 2 unique IDs and the target variable. Detailed features and explanations are listed in 7.2.1 Raw data description. By limiting to authors who have at least shown up on the list once, this approach guarantees the authors are influencial enough.
Based on the distribution of target variable "play", it can be easily found that the target variable is highly skewed, with most of the data clustered at the right size for click volume. We also fit the data to the normal distribution as shown of the black line in the visualization below. Thus, a log transformation is performed on the target variable before modeling.
Numeric Feature Analysis
For numeric features, the correlations between each feature and the target variable "play" are explored through visualization. As shown in the picture below, most of the numeric features are positively correlated with the target variable (i.e. the number of video reviews "video_review" is highly correlated with the target variable). Also, some features are highly correlated with each other, like "follower" and "likes" (0.86), so they should not be included in the model of linear regression at the same time due to the problem of multicollinearity.
Categorical Feature Analysis
Video types are transformed into dummy variables. As shown in the picture below, we only include the most frequently-appeared video types, which are related to the topic of food or life.
According to the visualization of time distribution shown below, it is clearly showed that more videos were created during a later time period. Thus, the method for splitting the dataset is based on time, which has relatively smaller window sizes for validation and testing datasets.
The original dataset contains information of 23450 videos, which has some missing values and outliers because of the erroneous data collection. Several steps are included in data cleaning. The data types of some attributes are converted, including converting float into integer and converting timestamp into datetime. Videos that have missing value are removed because of rare occurrences. The outliers of attribute "length" are deleted to prevent the influence on models and 90% quantile of length are kept.
It is assumed that number of views of videos will stabilize in one month after being posted. Thus, although the original dataset contains videos posted before Oct.1, 2020, videos posted between Sept.1, 2020 and Oct.1, 2020 have an increasing number of views, and only videos posted before Sept.1, 2020 are kept. Moreover, videos posted before Sept.1, 2019 are also dropped since there aren't many of them and they have less connection with videos posted in the future, which may result in poor estimation for the number of reviews of those videos. Since we focus on food related videos, only videos with type in food category and life category are maintained.
In addition, features that are irrelevant to our research are deleted, such as "bvid" and "author". Attribute "follower" is dropped because it is volgger's total number of followers when data is scrapped and cannot represent the number of followers when the video is posted. Attributes "likes" is also dropped for the same reason.
In the previous part of Exploratory Data Analysis (EDA), the distribution of target variable "play" showed that the target variable is highly skewed, with most of the data clusters at the low range for click volume. Thus, a log transformation is performed on the target variable for the modeling part.
last_3month_play and last_3month_count
To extract more information of the play history of the vlogger, we create two numeric features "last_3month_play" and "last_3month_count" to be the indicators of the historical click volume and posting frequency. We first group the dataset by the vloggers "author" and then aggregate the data by mean or count in a window size of 90 days from the date the video was created. The corresponding "last_3month_play" and "last_3month_count" vary for different videos, even under the same vlogger.
last_3month_comment and last_3month_review
We create two numeric features "last_3month_comment" and "last_3month_review" to be the indicators of the historical number of video comments and reviews. The approach is similar as above and the The corresponding "last_3month_comment" and "last_3month_review" should vary for different videos as well.
Dummy Variables for typeid
For the feature of different types of the video "typeid", it has 7 distinct values like "daily", "food", "food_country", "food_detect", "food_eval", "food_record", and "funny". We created dummy variables for this categorical feature and concatenated them into the dataset.
Dummy Variables for tag
Each video can have multiple tags that are labeled by the author and viewers. It would be highly sparse if we consider all the tags that appear. Hence, the top 20 frequent tags are being transformed to dummy variables. Videos with the tag will have value 1 for the tag or 0 otherwise.
day_time and is_weekend
Based on the time when the video was created, we create features for "year", "month", "day", "hour", "weekday" (Monday to Sunday), and "number_of_week" (1 to 52). To further aggregate the time-related information, two binary features "day_time" and "is_weekend" were created to capture the difference of time periods. "day_time" has value of 1 for time from 6 a.m. to 6 p.m., and value of 0 for otherwise; "is_weekend" has value of 1 for time from Saturday to Sunday, and value of 0 for otherwise.
Video titles and descriptions are also important for our prediction as it can affect video plays through attracting viewers’ attentions. Further, in business perspective, vloggers have the most control on the content of titles and descriptions. Therefore they may take actions suggested by our interpretations based on the feature importance. To transform text to features, we use TF-IDF and TextRank to get key words from titles and descriptions separately. To customized for Bilibili, special stop words are added to normal Chinese stop words and define our Bilibili key words to extract more video related information.
One concern of text features is that they have high dimensions and sparsity. To keep the interpretability of our features, we did not use SVD to decrease the dimensions. Instead, top 100 keywords are chosen directly from TF-IDF and Text Rank model separately. To alleviate the sparsity, we use Word2Vec to get similar words of the chosen words and took those similar words into counts. Finally, 145 key words are obtained for descriptions and 125 key words for titles.
The dataset contains videos posted between Sept.15, 2019 and Sept.1, 2020. We split the dataset into three disjoint sets on time dimension to make sure there is no overlap between training, validation and test. Videos posted before July 1, 2020 are selected as train set, videos posted between July 1, 2020 and Aug. 1, 2020 are selected as validation set, and videos posted between Aug. 1, 2020 and Sept.1, 2020 are selected as test set.
Selecting videos from a more recent time as test set allows us to evaluate model performance on a dataset in the near future, which is highly consistent with recent situation. Cross-validation is not used in order to prevent data leakage in the validation process. After conducting model selection in validation set, the final model is retrained on the combination of training set and validation set, and then fitted into test data to get the final result.
After data cleaning and feature engineering, the final dataset contains 14820 videos and 311 features. Detailed features and explanations are listed in 7.2.2 Data description after data cleaning and feature engineering.
We first explore three models with different input features to try out the performance:
The last two methods are intended to check whether the model can take advantage of useful information contained by titles and description. However, the outcome of the second method is no better than the first and the last method has even negative R-Squared. Hence, we think the signals contained by the text features are no more than the noises it contained. It may be caused by the limited amount of videos we have and the text analysis methods we use. Further, the popular video topics are changing rapidly, which decreases the useful signal contained by current titles and descriptions.
As a result, the exploration concludes that text features bring no value to our model. Hence, we carry on with only the none-text features in the following part.
From business perspective, the click volume of a vlogger’s incoming video is mostly evaluated by plays of the vlogger’s recent videos. Therefore, we directly use the average plays of a vlogger’s last-three-month videos "last_3month_play" as our baseline. To avoid the infinite value appeared during the log transformation, the average click volume which is zero originally is set to be 1 in the calculation.
As our prediction lies in the regression problem, we first try out some linear regression models as they are simple, basic and easy to interpret. In particular, we use LASSO and Ridge. Considering that we have many features, we plan to use LASSO to down select those features because LASSO uses l1-norm regularization. Ridge could prevent overfitting though the l2-norm regularization. Hence, we also explore Ridge as one of the basic model.
Because we have lots of features, more complex and flexible models could have a better performance. Therefore, except basic Decision Tree, we also tried several tree-based learning algorithms, such as Random Forests, LightGBM, XGBoost. The former is based on bagging, the two latter is based on boosting.
For a regression problem, there are several metrics that we can use to evaluate our model performance, such as mean absolute error(MAE), mean squared error (MSE), and coefficient determination (R-Squared). We chose MSE as our main evalutaion metric for hyper-parameter tuning. For the sake of communication, we also chose R-Squared as another evalutaion metric to better illustrate the performance. R-Squared stands for the proportion of variation in the outcome that can be explained by the predictor variables.
We use a training set to train the model and use a valid set to choose the optimal hyperparameters. This part illustrates this process by the example of LightGBM.
Step 1, we use default parameters of LightGBM and tune "n_estimators". From the picture below, we find:
Step 2, to decrease the bias, we tune the LightGBM model from the following perspective:
Method | Parameters |
---|---|
Reduce Complexity | "max_depth", "number_leaves", "min_child_sample", "min_child_weight" |
Add Randomness | "colsample_bytree", "subsample","subsample_freq" |
Increase Regularization | "reg_alpha", "reg_lambda" |
After tuning, we get the following outcome. We choose LightGBM as our final model as it has lowest validation MSE and highest R-Squared.
Model | Default: MSE-Train | Default: MSE-Valid | Tuned: MSE-Train | Tuned: MSE-Valid | Tuned: R2-Valid |
---|---|---|---|---|---|
Baseline | 1.293 | 1.481 | - | - | - |
LASSO | 2.099 | 2.165 | 2.080 | 1.678 | 0.618 |
Ridge | 2.080 | 2.014 | 2.062 | 1.647 | 0.625 |
Decision Tree | 0.000 | 2.188 | 0.907 | 1.096 | 0.685 |
Random Forests | 0.112 | 1.075 | 0.764 | 1.023 | 0.701 |
LightGBM | 0.501 | 1.050 | 0.734 | 0.964 | 0.715 |
XGBoost | 0.184 | 1.181 | 0.849 | 1.033 | 0.711 |
Our final model inherits parameters from the tuned LightGBM model and uses the combination of training and validation dataset to train. As the picture on the left, we find:
The impact of some features, such as "food_ratio" and "life_ratio", is ambiguous on the target variable. Therefore, we select some user-controllable features to have a deeper interpretation.
The below pictures are about the SHAP value of "length" and "last_3month_count". Understanding their impact on plays could help vloggers better adjust the time and efforts they devoted to the videos. We find:
The below pictures are about the SHAP value of "food_ratio" and "life_ratio". Understanding their impacts on plays could suggest vloggers better ways to allocate the proportion of video types they post. We find:
The below left picture is about the SHAP value of "hour". The right picture is from a Bilibili user analysis report [6], and shows hourly trends about "videos posted" and "live comments" that could stand for the traffic of Bilibili. Understanding the hour's impact on plays could help vloggers know when to post the video. We found:
Now let's take a cetain vlogger's videos as a demo on how to interpret its performance. The left picture shows all videos of the chosen vlogger in our testing dataset. The right picture shows the SHAP plot for a chosen video of the vlogger. It could help business users better understand how the prediction is derived and how to improve.
In summary, our model could:
The target user of our prediction model is the advertisers and the multi-channel network (MCN) companies, because they need to identify the promising vloggers and coorperate with them for product promotion.
Once the predictive model is implemented in the production system, the advertisers and the multi-channel network (MCN) companies will reach out to the promising vloggers at the very early stage, seeking any potential opportunities for product promotion. It will adversely affect the ecosystem of Bilibili as an online video sharing platform, and the commercialized content of the vlogger's videos will also harm the popularity of the vloggers and diminish their impact for product promotion in turn.
In addition, our model is only used for the purpose of prediction, which may have bias, so our team could not be responsible for any results caused by this prediction.
In this project, Tags of videos are selected as video attributes. Tags are added by users and highly represent their descriptions of each video. If the hits of videos with a specific tag are pretty high,it means this kind of videos are popular among viewers. In addition to attributes scrapped directly from the website, several innovative features are created through feature engineering. Keywords extracted from titles and descriptions are created as video attributes. Those attributes give us information about what kinds of titles will and descriptions can attract more users to watch the video. As long as those attractive words are included in videos’ titles and descriptions, the popularity of videos will boost to some extent. Attributes related to time, such as “day_time”, and “is_weekday”, help vloggers know the best time to post videos. Moreover, SHAP is used in our project to explain the output of our models,which helps us have a better understanding about how features impact the target variable.
There are some potential problems that can be improved. At first, the dataset is biased. Only vloggers on the top of rank will be included. However, those vloggers are the most popular ones and cannot represent all the vloggers. Due to time concerns, data scraped from the website is limited. In the future, more data can be scrapped from the website for training models, including more vloggers and more videos of each vlogger. At the same time, text data is also limited, increasing difficulty in extracting useful keywords from videos’ titles and distributions. Since data is scrapped only once, we only can obtain the information of videos when data is scrapped. For example, attributes “follower” is vlogger’s total number of followers at that moment. Videos with the same author have the same value in attribute "follower". The number of followers when each video is posted is unknown. However, if data are scrapped many times, this problem can be solved.
As for future work, we can expand scope to videos in different categories instead of only limit to food category. A more effective algorithm could be used to deal with text in order to extract more useful keywords. At the same time, images and audios of videos could be added as important features to help us predict the number of views more precisely. In addition to target at different categories, we also can target at each vlogger and predict the number of views of different kinds of videos posted by a specific vlogger. This can help each vlogger better understand their own strengths and weaknesses.
[1] 10 Youtube stats every marketer should know in 2020
[2] Reminder: Vlog trending in China
[3] Bilibili Stats and Facts(2020)
[4] How Do Vloggers Make Money?
[5] Four data analysis platform specifically targeting Bilibili: FeiGua, Xiaoxiao data, NewRank and DongZhan
[6] Bilibili user analysis
Feature | Type | Explanation | Example |
---|---|---|---|
bvid | string | Unique id of video | BV1qy4y1C7fS |
author | string | Vlogger name | 拜托了小翔哥 |
follower | float | Vlogger number of followers | 2840710 |
likes | float | Vlogger total number of likes | 4995407 |
category[1-15] | string | Categories where the vlogger post | 美食 |
category[1-15]_count | int | Number of videos in the category | 26 |
title | string | Video title | 小伙网上学来的饺子馅秘方… |
description | String | Video description | 记得戴上耳机,深夜躲在被窝里… |
created | string | Video post time | 2020-10-15 17:53:29 |
length | int | Video length in seconds | 396 |
play | float | Video number of views | 1040090 |
comment | float | Video number of comments | 2188 |
typeid | int | Video category id | 22 |
video_review | float | Video number of Danmakus | 7488 |
tag[1-19] | string | Video tags, max number of tags is 19, usually less than 10 | 深夜食堂, 家常菜 |
Feature | Type | Explanation | Example |
---|---|---|---|
video type[1-7]:daily/food/food_country/ food_detect /food_eval/ food_record/ funny | boolean | Video types, each video only classified in one type | 1 |
food ratio/life ratio | float | Percentage of food category videos of each author | 0.451613 |
day_time/is_weekend | boolean | Whether video is posted in day-time/weekend | 1 |
weekday | int | Video post day | 6 |
hour | int | Video post hour | 17 |
created_iso_yr/created_mo/created_mo_wk/created_iso_day | int | Video post year/month/week/day | 2020/9/13/37 |
date | datetime | Video post date | 2020-09-13 |
length | int | Video length in seconds | 396 |
last_3month_play | float | Log mean of play of videos in 3 month posted before this video | 15.100171 |
last_3month_comment | float | Log mean of number of videos' comments in 3 month posted before this video | 9.711419 |
last_3month_review | float | Log mean of number of videos' reviews in 3 month posted before this video | 10.674799 |
last_3month_count | int | Number of videos in 3month posted before this video | 11 |
tag[1-19]: e.g 深夜食堂, 家常菜 | boolean | Video tags, max number of tags is 19, usually less than 10 | 1 |
similar_desc XX[145]: e.g.牛排 (beef) | float | Text features extracted from video descriptions | 1.11 |
similar_title XX[125]: e.g. 好吃 (yammy) | float | Text features extracted from video titles | 0.1 |