# Movie Recommendations

Gladiator Silence of the Lambs WALL-E Toy Story
Alice 5 4 1
Bob 5 2
Carol 5
David 5 5
Eve 5 4

What movie should I recommend to Bob?
Will Carol like WALL-E?

Goal: Fill in entries of the "rating matrix"

# Problem Setup

• $$m$$ = # users
• $$n$$ = # movies
• $$r(i,j) =$$ rating of user $$i$$ for movie $$j$$
• $$R = (r(i,j))$$ the "rating matrix" ($$m \times n$$)

We only get to see some of the entries of the rating matrix and want to fill in the rest.

Our data is a list of $$L$$ ratings specified as follows:

• $$i_k$$: user index of $$k$$th rating
• $$j_k$$: movie index of $$k$$th rating
• $$r(i_k, j_k)$$: value of $$k$$th rating (1-5)

Example: in our original example we observed $$L=10$$ ratings

$$k$$ $$i_k$$ $$j_k$$ $$r(i_k, j_k)$$ Comment
1 1 1 5 (Alice, Gladiator, 5)
2 1 2 4 (Alice, Silence, 4)
3 1 3 1 (Alice, WALL-E, 1)
4 2 2 5 (Bob, Silence, 5)
... ... . .. ...
9 5 1 5 (Eve, Gladiator, 5)
10 5 2 4 (Eve, Silence, 4)

# Matrix Factorization Model

Assume each user has an unknown weight vector $$\mathbf{u}_i \in \mathbb{R}^d$$ and each movie has an unknown weight vector $$\mathbf{v}_j \in \mathbb{R}_d$$.

The predicted rating is

$h(i,j) = u_{i1}v_{j1} + u_{i2} v_{j2} + \ldots + u_{id} v_{jd} = \mathbf{u}_i^T \mathbf{v}_j$

### Interpretation

• $$\mathbf{v}_j$$ is a vector of features of movie $$j$$
• $$\mathbf{u}_i$$ is a vector of weights for feature $$i$$ that describe their preferences for different features

• Example: feature 1 describes comedy vs. drama
• $$v_{j1}$$ is negative $$\rightarrow$$ movie $$j$$ is a comedy
• $$v_{j1}$$ is positive $$\rightarrow$$ movie $$j$$ is a drama
• $$u_{i1}$$ is negative $$\rightarrow$$ user $$i$$ prefers comedies
• $$u_{i1}$$ is positive $$\rightarrow$$ user $$i$$ prefers dramas
• Example: feature 2 describes whether a movie is geared toward kids or adults

Unlike previous problems we don't observe the features or weights, and need to learn them both from the observed ratings.

### Parameters

• $$\mathbf{u}_i \in \mathbb{R}^d$$, $$i=1,\ldots, m$$
• $$\mathbf{v}_j \in \mathbb{R}^d$$, $$j=1,\ldots, n$$

### Learning problem

Find parameters such that $$h(i_k, j_k) = \mathbf{u}_{i_k}^T \mathbf{v}_{j_k} \approx r(i_k, j_k)$$ for $$k = 1,\ldots,L$$ (the training data) and take appropriate measures to not overfit.

# Why is This Called Matrix Factorization?

• Place the user weight vectors $$\mathbf{u}_i$$ into the rows of a matrix $$U$$ and the movie feature vectors $$\mathbf{v}_j$$ into the rows of a matrix $$V$$

$U = \begin{bmatrix} -\mathbf{u}_1^T -\\ -\mathbf{u}_2^T -\\ \ldots \\ -\mathbf{u}_m^T -\\ \end{bmatrix} \in \mathbb{R}^{m \times d} \qquad V = \begin{bmatrix} -\mathbf{v}_1^T -\\ -\mathbf{v}_2^T -\\ \ldots \\ -\mathbf{v}_n^T -\\ \end{bmatrix} \in \mathbb{R}^{n \times d}$

• Consider the product $$U V^T$$:

$\boxed{ \begin{array}{c} \\ U \\ \\ \end{array} } \boxed{ \begin{array}{c} \ \ \ V^T \ \ \ \end{array} }$

• It is easy to check that $$(i,j)$$ entry of $$UV^T$$ is equal to $$\mathbf{u}_i^T \mathbf{v}_j$$, which is our prediction for the $$(i,j)$$ entry of $$R$$

• In other words, our model is that $$R \approx U V^T$$ (a factorization of $$R$$)

• We choose $$U$$ and $$V$$ to get good predictions for those entries of $$R$$ that we can observe. As long as we don't overfit, this gives us power to generalize to entries we don't observe

# Your Job: Solve the Learning Problem

• Formulate a squared error cost function for this model
• Add regularization for both the user weight vectors $$\mathbf{u}_i$$ and the movie feature vectors $$\mathbf{v}_j$$
• Write down the partial derivatives of your cost function with respect to the entries of $$\mathbf{u}_i$$ and $$\mathbf{v}_j$$
• Plug the partial derivatives into stochastic gradient descent (SGD) and write down the update rule
• Implement SGD
• Tune parameters to get good performance on the validation set

## Logistics

• Groups of up to 3---work with someone new
• Any resources allowed: talk to me, other groups, etc.
• Submit predictions on test set
• Evaluation: root-mean squared error (RMSE) on test set

$\text{RMSE} = \sqrt{\frac{1}{\text{# test}}\sum_{(i,j) \in \text{test set}} (h(i, j) - r(i,j))^2}$

• Worth 1/3 of regular homework

<= 1.0 80%
<= 0.97 90%
<= 0.95 95%
<= 0.94 100%
• Members of winning team (lowest RMSE) get 1 free late day

• Due date announced at end of class

# Data and Code

Link to starter code and data

Matrix Factorization Techniques for Recommender Systems by Yehuda Koren, Robert Bell and Chris Volinsky

• Authors were on the winning team of Netflix prize

• Paper includes algorithms---but beware different notation

# Model Extensions

Once you nail the matrix factorization model, here are some ideas to get even better performance.

## Biases only baseline

A simpler model that helps introduce important ideas is the "biases" only model. This has an overall baseline score $$\mu$$ and an offset (or "bias") $$a_i$$ for each user as well as a bias $$b_j$$ for each movie. The model is:

$h(i,j) = \mu + a_i + b_j$

For example

• Suppose the overall average rating is $$\mu = 3.8$$
• However, alice loves movies, so her bias is $$a_1 = +0.4$$
• Bob is hard to please, so his bias is $$a_2 = -0.7$$
• Silence of the Lambs is hard to watch: $$b_2 = -0.4$$
• Everyone loves WALL-E: $$b_3 = +1.3$$
• Etc.

### Parameters

• $$\mu$$
• $$a_i$$, $$i=1,\ldots, m$$
• $$b_j$$, $$j=1,\ldots, n$$

To learn these parameters, write down the partial derivatives of the cost function with respect $$\mu$$, $$a_i$$, and $$b_j$$ and plug them into stochastic gradient descent.

## Matrix Factorization + Biases

The biases only model can be incorporated into the matrix factorization model to improve performance:

$h(i,j) = \mu + a_i + b_j + \mathbf{u}_i^T \mathbf{v}_j$

### Parameters

• $$\mu$$
• $$a_i$$, $$i=1,\ldots, m$$
• $$b_j$$, $$j=1,\ldots, n$$
• $$\mathbf{u}_i \in \mathbb{R}^d$$, $$i=1,\ldots, m$$
• $$\mathbf{v}_j \in \mathbb{R}^d$$, $$j=1,\ldots, n$$

To learn these parameters, combine the partial derivatives from the basic matrix factorization model with those from the biases only model and update them all within stochastic gradient descent.