CS 335: Matrix Factorization for Movie Recommendations

Dan Sheldon

Movie Recommendations

	Gladiator	Silence of the Lambs	WALL-E	Toy Story
Alice	5	4	1
Bob		5		2
Carol				5
David			5	5
Eve	5	4

What movie should I recommend to Bob?
Will Carol like WALL-E?

Goal: Fill in entries of the “rating matrix”

Problem Setup

\(m\) = # users
\(n\) = # movies
\(r(i,j) =\) rating of user \(i\) for movie \(j\)
\(R = (r(i,j))\) the “rating matrix” (\(m \times n\))

We only get to see some of the entries of the rating matrix and want to fill in the rest.

Our data is a list of \(L\) ratings specified as follows:

\(i_k\): user index of \(k\)th rating
\(j_k\): movie index of \(k\)th rating
\(r(i_k, j_k)\): value of \(k\)th rating (1-5)

Example: in our original example we observed \(L=10\) ratings

\(k\)	\(i_k\)	\(j_k\)	\(r(i_k, j_k)\)	Comment
1	1	1	5	(Alice, Gladiator, 5)
2	1	2	4	(Alice, Silence, 4)
3	1	3	1	(Alice, WALL-E, 1)
4	2	2	5	(Bob, Silence, 5)
…	… .	..	…
9	5	1	5	(Eve, Gladiator, 5)
10	5	2	4	(Eve, Silence, 4)

Matrix Factorization Model

Assume each user has an unknown weight vector \(\mathbf{u}_i \in \mathbb{R}^d\) and each movie has an unknown weight vector \(\mathbf{v}_j \in \mathbb{R}_d\).

The predicted rating is

\[ h(i,j) = u_{i1}v_{j1} + u_{i2} v_{j2} + \ldots + u_{id} v_{jd} = \mathbf{u}_i^T \mathbf{v}_j \]

Interpretation

\(\mathbf{v}_j\) is a vector of features of movie \(j\)
\(\mathbf{u}_i\) is a vector of weights for feature \(i\) that describe their preferences for different features
Example: feature 1 describes comedy vs. drama
- \(v_{j1}\) is negative \(\rightarrow\) movie \(j\) is a comedy
- \(v_{j1}\) is positive \(\rightarrow\) movie \(j\) is a drama
- \(u_{i1}\) is negative \(\rightarrow\) user \(i\) prefers comedies
- \(u_{i1}\) is positive \(\rightarrow\) user \(i\) prefers dramas
Example: feature 2 describes whether a movie is geared toward kids or adults

Unlike previous problems we don’t observe the features or weights, and need to learn them both from the observed ratings.

Parameters

\(\mathbf{u}_i \in \mathbb{R}^d\), \(i=1,\ldots, m\)
\(\mathbf{v}_j \in \mathbb{R}^d\), \(j=1,\ldots, n\)

Learning problem

Find parameters such that \(h(i_k, j_k) = \mathbf{u}_{i_k}^T \mathbf{v}_{j_k} \approx r(i_k, j_k)\) for \(k = 1,\ldots,L\) (the training data) and take appropriate measures to not overfit.

Why is This Called Matrix Factorization?

Place the user weight vectors \(\mathbf{u}_i\) into the rows of a matrix \(U\) and the movie feature vectors \(\mathbf{v}_j\) into the rows of a matrix \(V\)

\[ U = \begin{bmatrix} -\mathbf{u}_1^T -\\ -\mathbf{u}_2^T -\\ \ldots \\ -\mathbf{u}_m^T -\\ \end{bmatrix} \in \mathbb{R}^{m \times d} \qquad V = \begin{bmatrix} -\mathbf{v}_1^T -\\ -\mathbf{v}_2^T -\\ \ldots \\ -\mathbf{v}_n^T -\\ \end{bmatrix} \in \mathbb{R}^{n \times d} \]
Consider the product \(U V^T\):

\[ \boxed{ \begin{array}{c} \\ U \\ \\ \end{array} } \boxed{ \begin{array}{c} \ \ \ V^T \ \ \ \end{array} } \]
It is easy to check that \((i,j)\) entry of \(UV^T\) is equal to \(\mathbf{u}_i^T \mathbf{v}_j\), which is our prediction for the \((i,j)\) entry of \(R\)
In other words, our model is that \(R \approx U V^T\) (a factorization of \(R\))
We choose \(U\) and \(V\) to get good predictions for those entries of \(R\) that we can observe. As long as we don’t overfit, this gives us power to generalize to entries we don’t observe

Your Job: Solve the Learning Problem

Formulate a squared error cost function for this model
Add regularization for both the user weight vectors \(\mathbf{u}_i\) and the movie feature vectors \(\mathbf{v}_j\)
Write down the partial derivatives of your cost function with respect to the entries of \(\mathbf{u}_i\) and \(\mathbf{v}_j\)
Plug the partial derivatives into stochastic gradient descent (SGD) and write down the update rule
Implement SGD
Tune parameters to get good performance on the validation set

Logistics

Groups of up to 3—work with someone new
Any resources allowed: talk to me, other groups, etc.
Submit predictions on test set
Evaluation: root-mean squared error (RMSE) on test set

\[ \text{RMSE} = \sqrt{\frac{1}{\text{# test}}\sum_{(i,j) \in \text{test set}} (h(i, j) - r(i,j))^2}\]
Worth 1/3 of regular homework

RMSE grade

<= 1.0 80%

<= 0.97 90%

<= 0.95 95%

<= 0.94 100%
Members of winning team (lowest RMSE) get 1 free late day
Due date announced at end of class

RMSE	grade
<= 1.0	80%
<= 0.97	90%
<= 0.95	95%
<= 0.94	100%

Demo

Data and Code

Link to starter code and data

Futher Reading

Matrix Factorization Techniques for Recommender Systems by Yehuda Koren, Robert Bell and Chris Volinsky

Authors were on the winning team of Netflix prize
Paper includes algorithms—but beware different notation

Model Extensions

Once you nail the matrix factorization model, here are some ideas to get even better performance.

Biases only baseline

A simpler model that helps introduce important ideas is the “biases” only model. This has an overall baseline score \(\mu\) and an offset (or “bias”) \(a_i\) for each user as well as a bias \(b_j\) for each movie. The model is:

\[ h(i,j) = \mu + a_i + b_j \]

For example

Suppose the overall average rating is \(\mu = 3.8\)
However, alice loves movies, so her bias is \(a_1 = +0.4\)
Bob is hard to please, so his bias is \(a_2 = -0.7\)
Silence of the Lambs is hard to watch: \(b_2 = -0.4\)
Everyone loves WALL-E: \(b_3 = +1.3\)
Etc.

Parameters

\(\mu\)
\(a_i\), \(i=1,\ldots, m\)
\(b_j\), \(j=1,\ldots, n\)

To learn these parameters, write down the partial derivatives of the cost function with respect \(\mu\), \(a_i\), and \(b_j\) and plug them into stochastic gradient descent.

Matrix Factorization + Biases

The biases only model can be incorporated into the matrix factorization model to improve performance:

\[ h(i,j) = \mu + a_i + b_j + \mathbf{u}_i^T \mathbf{v}_j \]

Parameters

\(\mu\)
\(a_i\), \(i=1,\ldots, m\)
\(b_j\), \(j=1,\ldots, n\)
\(\mathbf{u}_i \in \mathbb{R}^d\), \(i=1,\ldots, m\)
\(\mathbf{v}_j \in \mathbb{R}^d\), \(j=1,\ldots, n\)

To learn these parameters, combine the partial derivatives from the basic matrix factorization model with those from the biases only model and update them all within stochastic gradient descent.