Recommendation System

This is my personal notes taken for the course Machine learning by Standford. Feel free to check the assignments.
Also, if you want to read my other notes, feel free to check them at my blog.

I) Introduction

A recommender system or a recommendation system is a subclass of information filtering system that seeks to predict the "rating" or "preference" a user would give to an item.

Problem Formulation: (Predicting Movie Ratings case)

Let's define:

$n_{u}$ = number of users.
$n_{m}$ = number of movies.
$r (i, j) = 1$ if user
$j$ has rated movie
$i$ .
$y (i, j)$ is the rating given by user
$j$ to the movie
$i$ , defined only if
$r (i, j) = 1$ .

The objective of the recommender system is to use the rated movies by the users to predict the rating a user would give to a non-rated item.

To do so, 2 methods:

Content-based filtering.
Collaborative filtering.

II) Content-based filtering

Let's define:

$x^{(i)}$ = feature vector for movie
$i$ . (in our case, it is the movie type : romance, action ...)
$θ^{(j)}$ = parameter vector for user
$j$ . (user preferences : do they like more romance than action ?)
$m^{(j)}$ = number of movies rated by user
$j$ .

The goal of this method is to predict the user rating on a non-rated movie based on movies characteristics. For example, when a friend asks you for a book recommendation, it's pretty natural to ask what kinds of books they have read and liked. From there, you could think of a few titles (books characteristics: fantasy, sci-fi ...) that are similar to the things they've read and liked in the past (user past books review).

So, in our case, it means that you need to have beforehand:

movies characteristics,
$x^{(i)}$ .
users past movie review,
$y$ .

With that, you can find user preference denoted as

θ

. Once you have user preference, it becomes easy to predict the user rating on a non-rated movie.

Suppose user

j

has rated

m^{(j)}

movies, then learning

θ^{(j)}

can be treated as linear regression problem. So, to learn

θ^{(j)}

m i n_{θ^{(j)}} \frac{1}{2} \sum_{i : r (i, j) = 1} {[(θ^{(j)})^{T} (x^{(i)}) - y (i, j)]}^{2} + \frac{λ}{2} \sum_{k = 1}^{n} (θ_{k}^{(j)})^{2}

To get the parameters for all our users, we do the following:

m i n_{θ^{(1)}, \dots, θ^{(n_{u})}} \frac{1}{2} \sum_{j = 1}^{n_{u}} \sum_{i : r (i, j) = 1} {[(θ^{(j)})^{T} (x^{(i)}) - y (i, j)]}^{2} + \frac{λ}{2} \sum_{j = 1}^{n_{u}} \sum_{k = 1}^{n} (θ_{k}^{(j)})^{2}

The cost function is then:

J (θ^{(1)}, \dots, θ^{(n_{u})}) = \frac{1}{2} \sum_{j = 1}^{n_{u}} \sum_{i : r (i, j) = 1} {[(θ^{(j)})^{T} (x^{(i)}) - y (i, j)]}^{2} + \frac{λ}{2} \sum_{j = 1}^{n_{u}} \sum_{k = 1}^{n} (θ_{k}^{(j)})^{2}

The gradient descent update is then:

\begin{array}{ll} θ_{k}^{(j)} & = θ_{k}^{(j)} - α (\sum_{i : r (i, j) = 1} [(θ^{(j)})^{T} x^{(i)} - y (i, j)] x_{k}^{(i)}), for k = 0 (bias/intercept term) \\ θ_{k}^{(j)} & = θ_{k}^{(j)} - α (\sum_{i : r (i, j) = 1} [(θ^{(j)})^{T} x^{(i)} - y (i, j)] x_{k}^{(i)} + λ θ_{k}^{(j)}), otherwise \end{array}

Remark: The effectiveness of content based recommendation depends on identifying the features

x^{(i)}

properly, which is often not easy.

III) Collaborative filtering

Collaborative filtering has the intrinsic property of feature learning (it can learn by itself what features to use) which helps overcome drawbacks of content-based recommender systems.

Given the user past movie review

y

and the parameter vector

θ

, the algorithm learns the values for the features

x^{(i)}

by applying linear regression.

m i n_{x^{(i)}} \frac{1}{2} \sum_{j : r (i, j) = 1} {[(θ^{(j)})^{T} x^{(i)} - y (i, j)]}^{2} + \frac{λ}{2} \sum_{k = 1}^{n} {(x_{k}^{(i)})}^{2}

Intuitively this boils down to the scenario where given a movie and its ratings by various users (

y

) and user preferences

θ

, the collaborative filitering algorithm tries to find the most optimal features to represent the movies such that the squared error between the two is minimized.

Since this is very similar to the linear regression problem, regularization term is introduced to prevent overfitting of the features learnt. Similarly by extending this, it is possible to learn all the features for all the movies

i \in [1, n_{m}]

. Thus,

m i n_{x^{(1)}, \dots, x^{(n_{m})}} \frac{1}{2} \sum_{i = 1}^{n_{m}} \sum_{j : r (i, j) = 1} {[(θ^{(j)})^{T} x^{(i)} - y (i, j)]}^{2} + \frac{λ}{2} \sum_{i = 1}^{n_{m}} \sum_{k = 1}^{n} {(x_{k}^{(i)})}^{2}

where the gradient descent update on features is:

x_{k}^{(i)} := x_{k}^{(i)} - α (\sum_{j : r (i, j) = 1} [(θ^{(j)})^{T} x^{(i)} - y (i, j)] θ_{k}^{(j)} + λ x_{k}^{(i)})

But it is also possible to solve for both

θ

and

x

simultaneously, given by an update rule which is nothing but the combination of the earlier two update rules. Thus,

\begin{array}{ll} J (x^{(1)}, \dots, x^{(n_{m})}, θ^{(1)}, \dots, θ^{(n_{u})}) & = \frac{1}{2} \sum_{(i, j) : r (i, j) = 1} {[(θ^{(j)})^{T} x^{(i)} - y (i, j)]}^{2} \\ + \frac{λ}{2} \sum_{i = 1}^{n_{m}} \sum_{k = 1}^{n} (x_{k}^{(i)})^{2} \\ + \frac{λ}{2} \sum_{j = 1}^{n_{u}} \sum_{k = 1}^{n} (θ_{k}^{(j)})^{2} \end{array}

\sum_{(i, j) : r (i, j) = 1}

is equivalent to looping through all the data where

r (i, j) = 1

And the minimization objective can be written as,

m i n_{x^{(1)}, \dots, x^{(n_{m})}, θ^{(1)}, \dots, θ^{(n_{u})}} J (x^{(1)}, \dots, x^{(n_{m})}, θ^{(1)}, \dots, θ^{(n_{u})})

Remark: Because the algorithm can learn feature by itself, the bias units where

x_{0} = 1

and

θ_{0} = 1

have been removed, therefore

x \in R^{n}

and

θ \in R^{n}

To summarize, the collaborative filtering algorithm has the following
steps:

Initialize
$x^{(1)}, \dots, x^{(n_{m})}, θ^{(1)}, \dots, θ^{(n_{u})}$ to small random values.
Minimize
$J (x^{(1)}, \dots, x^{(n_{m})}, θ^{(1)}, \dots, θ^{(n_{u})})$ using gradient descent or any other advance optimization algorithm. The update rules given below can be obtained by following the partial derivatives along
$x$ s and
$θ$ s.
$\begin{array}{ll} x_{k}^{(i)} & = x_{k}^{(i)} - α (\sum_{j : r (i, j) = 1} [(θ^{(j)})^{T} x^{(i)} - y (i, j)] θ_{k}^{(j)} + λ x_{k}^{(i)}) \\ θ_{k}^{(j)} & = θ_{k}^{(j)} - α (\sum_{i : r (i, j) = 1} [(θ^{(j)})^{T} x^{(i)} - y (i, j)] x_{k}^{(i)} + λ θ_{k}^{(j)}) \end{array}$
For a user with parameter
$θ$ and a movie with learned features
$x$ , the predicted star rating is given by
$θ^{T} x$ .

Consequently, the matrix of all predicted ratings of all movies by all users

Y

can be written as:

Y = [\begin{matrix} (θ^{(1)})^{T} x^{(1)} & (θ^{(2)})^{T} x^{(1)} & \dots & (θ^{(n_{u})})^{T} x^{(1)} \\ (θ^{(1)})^{T} x^{(2)} & (θ^{(2)})^{T} x^{(2)} & \dots & (θ^{(n_{u})})^{T} x^{(2)} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ (θ^{(1)})^{T} x^{(n_{m})} & (θ^{(2)})^{T} x^{(n_{m})} & \dots & (θ^{(n_{u})})^{T} x^{(n_{m})} \end{matrix}]

where

y (i, j)

is the rating for movie

i

by user

j

IV) Implementation detail

How to handle the case where a user has not rated any movies ?

The term in our cost function is

\frac{1}{2} \sum_{(i, j) : r (i, j) = 1} {[(θ^{(j)})^{T} x^{(i)} - y^{(i, j)}]}^{2} = 0

because the summation applies only if the user has rated a movie. Thus,

J (x^{(1)}, \dots, x^{(n_{m})}, θ^{(1)}, \dots, θ^{(n_{u})}) = \frac{λ}{2} \sum_{i = 1}^{n_{m}} \sum_{k = 1}^{n} (x_{k}^{(i)})^{2} + \frac{λ}{2} \sum_{j = 1}^{n_{u}} \sum_{k = 1}^{n} (θ_{k}^{(j)})^{2}

When minimizing our cost function, we will find

θ^{(5)}

equal to the 0 vector because the only term to pull away our regularization term on theta from 0 is equal to 0 (See above). Thus, when it comes to predict movies for Eve, it will all be equal to 0 (

θ^{(5)} x^{(i)} = 0

) which does not seem intuitively correct.

To prevent this, we will do mean normalization. Here is an example:

Y = [\begin{matrix} 5 & 5 & 0 & 0 \\ 4 & ? & ? & 0 \\ 0 & 0 & 5 & 4 \\ 0 & 0 & 5 & 0 \end{matrix}] μ = [\begin{matrix} 2.5 \\ 2 \\ 2.25 \\ 1.25 \end{matrix}]

where

μ_{j} = \frac{\sum_{j : r (i, j) = 1} Y_{i, j}}{\sum_{j} r^{(i, j)}}

Then, let's define

Y^{'} = Y - μ

Y^{'} = [\begin{matrix} 2.5 & 2.5 & - 2.5 & - 2.5 \\ 2 & ? & ? & - 2 \\ - .2 .25 & - 2.25 & 3.75 & 1.25 \\ - 1.25 & - 1.25 & 3.75 & - 1.25 \end{matrix}]

Which means that in our cost function, we need to put

θ^{(j) T} x^{(i)} + μ_{i}

without the summation condition

j : r (i, j) = 1

instead of

θ^{(j) T} x^{(i)}

V) Content-based vs Collaborative filtering

Content-based recommendation engine works with existing profiles of users. A profile has information about a user and their taste. Taste is based on user rating for different items. Generally, whenever a user creates his profile, Recommendation engine does a user survey to get initial information about the user in order to avoid new user problem.

In the recommendation process, the engine compares the items that are already positively rated by the user with the items he didn't rate and looks for similarities. Items similar to the positively rated ones will be recommended to the user. Here, based on user’s taste and behavior a content-based model can be built by recommending articles relevant to user’s taste. This model is efficient and personalized yet it lacks something.

Let us understand this with an example. Assume there are four categories of news:

Politics
Sports
Entertainment
Technology

and there is a user A who has read articles related to Technology and Politics. The content-based recommendation engine will only recommend articles related to these categories and may never recommend anything in other categories as the user never viewed those articles before.

This problem can be solved using another variant of recommendation algorithm known as Collaborative Filtering.

The idea of collaborative filtering is finding users in a community that share appreciations. If two users have same or almost same rated items in common, then they have similar taste. Such users build a group or a so-called neighborhood. A user gets recommendations for those items that user hasn't rated before but was positively rated by users in his/her neighborhood.

Collaborative filtering has basically 2 approaches:

User Based Approach: In this approach, Items that are recommended to a user are based on an evaluation of items by users of the same neighborhood, with whom he/she shares common preferences. If the article was positively rated by the community, it will be recommended to the user. In the user-based approach, articles which are already rated by a user, play an important role in searching for a group that shares appreciations with him/her.
Item Based Approach: Referring to the fact that the taste of users remains constant or change very slightly, similar articles build neighborhoods based on appreciations of users. Afterwards, the system generates recommendations with articles in the neighborhood that a user might prefer.

Let's try to understand above picture. Let's say there are three users A, B and C.

In user-based approach, user A and C are similar because both of them like Strawberry and Watermelon. Now user A likes Grapes and Orange too. So user-based approach will recommend Grapes and Orange to user C.
In item-based approach, Grapes and Watermelon will form the similar items neighborhood which means irrespective of users, different items which are similar will form a neighborhood. So when user C likes Watermelon, the other item from the same neighborhood i.e Grapes will be recommended by item-based approach.

Recommendation System

I) Introduction

II) Content-based filtering

III) Collaborative filtering

IV) Implementation detail

V) Content-based vs Collaborative filtering

Read more

Convolutional Neural Network with Numpy (Fast)

Convolutional Neural Network with Numpy (Slow)

AlexNet: Summary and Implementation

ZFNet/DeconvNet: Summary and Implementation