Assignment 4: Building a recommender with BigData Republic

# Assignment 4: Building a recommender with BigData Republic ## Introduction We, Domantas Giržadas and Tijs Rozenbroek, took part in a hackathon organised by two Radboud University alumni from the company "BigData Republic". Our methods for trying to attain the highest accuracy possible are described in this blog post. ## Example method We were provided with an example method of applying an Alternating Least Squares method for collaborative filtering. However, the crucial sections of the process were left blank for us to fill in. ### Task 1: UDF for converting action scores The provided dataframes contained quite a few features describing users' interactions with available vacancies on the site. One of the most important data columns for this assignment was the `ecom_action` column in the `clicks` dataframe. It contained numbers that represent the type of user interaction: - Values **1** and **2** represent a **page view**, - Values **3** and **5** represent a **started application procedure**, - Value **6** represents a **finished application**. However, this set of values is suboptimal, because there are five values representing only 3 categories. So, to solve this problem, we need to define a function, which would remap the contents of this column into new desired values: ```scala val convert = (col: Column) => { regexp_replace(regexp_replace(regexp_replace(col, "1|2", "<val1>"), "3|5", "<val2>"), "6", "<val3>") } ``` *In this function, `<val*>` can be replaced with any desired value.* Then we applied this function on the training and validation dataframes: ```scala val df_mapped = df.withColumn("ecom_action_mapped", convert(col("ecom_action")).cast(IntegerType)) ``` ### Task 2: Building and fitting the ALS model Now that we have our dataframes cleaned up, we can fit our model. To do so, we define a new ALS model from the sparkML library and set the user, item and rating columns. The values for other parameters are just taken from [the example in the documentation](https://spark.apache.org/docs/2.2.0/ml-collaborative-filtering.html) ```scala val als = new ALS() .setMaxIter(5) .setRegParam(0.01) .setUserCol("candidate_number") .setItemCol("vacancy_number") .setRatingCol("ecom_action_mapped") ``` However, in this case, the ratings are not explicit - the values do not directly represent the candidates' opinions about the vacancies. Instead, these ratings are **implicitly** represented by the candidates' interactions with the vacancy. This means that we need to inform the model that the ratings are not explicit with the following parameter: ```scala .setImplicitPrefs(true) ``` After initialising the model, we can fit it on the cleaned up training data: ```scala val model = als.fit(clicks_train_mapped) ``` **Now we have a working basic ALS model!** However, this model now predicts which vacancies which candidate would like. This approach is quite volatile, since individual vacancies always come and go. Our model would become obsolete within a very short period of time. That is why we need to consider a different approach. How about instead of predicting individual vacancies, we predict **which functions (vacancy categories) are the most relevant for the candidates**? ### Task 3: Indexing the function strings In order to do that, we first need to group our data based on the function. To make this possible, let's convert the function names (strings) from the dataframes into function indices (numbers) with the following `StringIndexer`: ```scala val indexer = new StringIndexer() .setInputCol("function_name") .setOutputCol("function_index") .setHandleInvalid("skip") ``` After indexing the vacancy categories, we can group the dataframe, based on these indices: ```scala val df_grouped = df_indexed.groupBy("candidate_number", "function_index") ``` We still want to keep the information about the user interaction with these categories, so we aggregate the action scores by summing them up and storing them in a new column `rating_sum`: ```scala df_grouped = df_grouped.agg(sum("ecom_action_mapped").alias("rating_sum")) ``` ## Hacking time ### Tweaking the ALS The first step we took was tweaking the ALS. We tried many different values for the different parameters. The values we tried and the corresponding results can be inspected below. | RegParam value | SCORE | |:--------------:|:------:| | 0.1 | 15.55% | | 0.25 | 16.76% | | **0.3** |**16.83%**| | 0.35 | 16.67% | | 0.5 | 16.62% | As you can see from the accuracies above, the best value for the regularisation parameter of the ALS is 0.3. | MaxIter value | SCORE | |:-------------:|:------:| | 8 | 16.80% | | **10** |**16.83%**| | 15 | 16.74% | The number of iterations seems to make a small difference, but does not increase the accuracy when we increase or decrease it relative to the default of 10. | Alpha value | SCORE | |:-----------:|:------:| | 0.1 | 15.25% | | 0.5 | 16.39% | | 0.75 | 16.42% | | **1.0** |**16.83%**| As you can see the alpha parameter is best kept at 1.0, which is the default value. ### Tweaking the action weights The weights of the actions are one of the things that can be tweaked to maximise performance. The prespecified values that the actions are mapped to are 2 for page views, 4 for starting an application procedure and 6 for finishing an application. Below you can see the results of trying different values. | "Click" weight | "Started application" weight | "Finished application" weight | SCORE | |:------------:|:--------------------------:|:---------------------------:|:------:| | 1.0 | 4.0 | 12.0 | 16.56% | | 2.0 | 4.0 | 6.0 | 16.83% | | 2.0 | 5.0 | 7.0 | 16.53% | | 2.0 | 5.0 | 9.0 | 16.81% | | 2.0 | 4.0 | 12.0 | 16.72% | | 2.0 | 5.0 | 12.0 | 16.58% | | 2.0 | 6.0 | 16.0 | 16.79% | | **2.0** | **8.0** | **20.0** |**16.90%**| | 2.0 | 8.0 | 22.0 | 16.74 | | 2.0 | 12.0 | 30.0 | 16.85% | The action weights with the best accuracy are 2, 8, and 20 for page views, starting applications and finishing applications respectively. This is of course equal to the weights 1, 4 and 10, which are the action weights that we ended up using. ### Tweaking the feature weight Another possible improvement is adjusting the weights to the features for the weighted sum before picking the final top 15. The initial weights were: - **1.0** for the `normalized distance` feature (ND) - **1.0** for the `normalized_prediction` feature (NP) and - **0.5** for the `normalized_request_hour_wage` feature (NW). These weights (together with the best values for other parameters) resulted in a prediction with a score of **~16.9%** Changing the values of NP and NW from the default only seemed to decrease the prediction score, while tweaking the ND parameter values showed a surprising trend: | ND weight | NP weight | NW weight | SCORE | |:---------:|:---------:|:---------:|:-----:| | 1.0 | 1.0 | 0.5 | 16.9% | | 5.0 | 1.0 | 0.5 | 19.2% | | 7.0 | 1.0 | 0.5 | 19.4% | | **15.0** | **1.0** | **0.5** |**19.5%**| With the weights that show the best performance, basically only the distance to the vacancy is taken into account when forming the recommendation. While that works in this specific case, this method would not work in the real world, so we decided to keep the default values **(1.0, 1.0, 0.5)** for the feature weights. ### Filtering based on distance One of the optimisations we implemented is filtering out vacancies from the predictions of the ALS based on the maximum travel distance that people specify in their profiles. At first we filtered using the following line of code: ```scala val vacancies_distance_filtered = vacancies_distance.where($"distance" <= $"maximum_travel_distance") ``` This proved to be a good optimisation, but we got the idea that perhaps this filtering is too strict, so we subsequently tried to multiply the `maximum_travel_distance` with a few different values above one. This means that we do not view the maximum travel distance that people specify in their profile as an absolute maximum, but instead we take a bit of a margin. After testing the different values above one, we settled on 1.2 as this was the value that resulted in the best accuracy. The following line of code was used to do this: ```scala val vacancies_distance_filtered = vacancies_distance.where($"distance" <= (lit(1.2) * $"maximum_travel_distance")) ``` This gave us a significant improvement over filtering without multiplying with 1.2. ### Using more features In addition to filtering based on the max distance specified in the candidates' profiles, we also tried using the minimum and maximum hours (`week_hours_min` and `week_hours_max` respectively) also specified on the profiles in combination with the amount of hours specified in a vacancy. These features were selected from the profiles DataFrame by changing the following line: ```scala val profiles_limited = profiles.select("candidate_number", "cand_pc", "maximum_travel_distance") ``` to ```scala val profiles_limited = profiles.select("candidate_number", "cand_pc", "maximum_travel_distance", "week_hours_min", "week_hours_max") ``` Some of the ways of filtering using the min and max hours can be seen below. Filtering only based on the minimum week hours: ```scala val vacancies_distance_filtered = vacancies_distance.where(($"distance" <= (lit(1.2) * $"maximum_travel_distance")) && ($"week_hours" >= $"week_hours_min")); ``` Filtering based on both minimum and maximum week hours: ```scala val vacancies_distance_filtered = vacancies_distance.where(($"distance" <= (lit(1.2) * $"maximum_travel_distance")) && ($"week_hours" >= $"week_hours_min") && ($"week_hours" <= $"week_hours_max")); ``` Filtering based on both minimum and maximum week hours, but then multiplying the minimum and maximum to increase the margins: ```scala val vacancies_distance_filtered = vacancies_distance.where(($"distance" <= (lit(1.2) * $"maximum_travel_distance")) && ($"week_hours" >= (lit(0.8) * $"week_hours_min")) && ($"week_hours" <= (lit(1.2) * $"week_hours_max"))); ``` As mentioned, we listed **some** of the ways we tried to use the minimum and maximum hours features. In reality we tried a lot more different combinations and multiplications. Unfortunately, all of the ways we tried to use these two features decreased the accuracy. After this, we tried using the candidates' wages also included in the `profiles` DataFrame. Including the `candidate_hour_wage` in `profiles_limited`: ```scala val profiles_limited = profiles.select("candidate_number", "cand_pc", "maximum_travel_distance", "candidate_hour_wage") ``` We attempted to use this feature in the following way: ```scala val vacancies_distance_filtered = vacancies_distance.where(($"distance" <= (lit(1.2) * $"maximum_travel_distance")) && ($"request_hour_wage" >= $"candidate_hour_wage")); ``` We again experimented with different multiplication factors for adding a margin. Unfortunately, this feature too decreased the accuracy on the validation set. ### Taking more predictions from the ALS The default amount of function predictions per candidate that is taken from the ALS is 5. We found that when we increase the amount to 20 it increases the accuracy significantly. This increased amount of predictions can be thought of as reflecting a somewhat lower 'faith' we have in the performance of the ALS. See the line below for how we did this: ```scala predictions = take_top_n_grouped(predictions, n=20, sortColumn="prediction", groupColumn="candidate_number") ``` ### Wage null values We use the `normalized_request_hour_wage` feature for the final weighted sum of the prediction. However, not all vacancies have shared the offered hourly wage on their page, so the dataframe ended up containing `null` values in the respective column. Since `x + null = null`, vacancies that have not disclosed their offered wage, will now have `null` for the final score in the recommender and will most likely not end up in the top 15 at all. To tackle this problem, we need to replace `null` values with a numeric value. The question now is *what value do we replace it with?* To answer this question we can just try a few different ones and find which one gives the best performance. However, after conducting this test, we have found that the best value for this replacement on the validation set is **0.7**. Conceptually, this value does not make sense, as now, a vacancy with no disclosed wage offer is equal to a vacancy with 70% of the maximum wage in the whole dataframe. So, we decided to stick to an educated (and opinion-based) guess and set this value to **0.4**, since offers with undisclosed offered wage seem more suspicious and less attractive than average to us. ### Final results ##### The final accuracy on the validation set that our model achieved was: **~21.4%** ##### While on the test set, the accuracy of the model was: **~18.34%**