Dan Sheldon

Gladiator | Silence of the Lambs | WALL-E | Toy Story | |
---|---|---|---|---|

Alice | 5 | 4 | 1 | |

Bob | 5 | 2 | ||

Carol | 5 | |||

David | 5 | 5 | ||

Eve | 5 | 4 |

What movie should I recommend to Bob?

Will Carol like WALL-E?

**Goal**: Fill in entries of the “rating matrix”

- \(m\) = # users
- \(n\) = # movies
- \(r(i,j) =\) rating of user \(i\) for movie \(j\)
- \(R = (r(i,j))\) the “rating matrix” (\(m \times n\))

We only get to see some of the entries of the rating matrix and want to fill in the rest.

Our data is a list of \(L\) ratings specified as follows:

- \(i_k\): user index of \(k\)th rating
- \(j_k\): movie index of \(k\)th rating
- \(r(i_k, j_k)\): value of \(k\)th rating (1-5)

**Example**: in our original example we observed \(L=10\) ratings

\(k\) | \(i_k\) | \(j_k\) | \(r(i_k, j_k)\) | Comment |
---|---|---|---|---|

1 | 1 | 1 | 5 | (Alice, Gladiator, 5) |

2 | 1 | 2 | 4 | (Alice, Silence, 4) |

3 | 1 | 3 | 1 | (Alice, WALL-E, 1) |

4 | 2 | 2 | 5 | (Bob, Silence, 5) |

… | … . | .. | … | |

9 | 5 | 1 | 5 | (Eve, Gladiator, 5) |

10 | 5 | 2 | 4 | (Eve, Silence, 4) |

Assume each user has an unknown weight vector \(\mathbf{u}_i \in \mathbb{R}^d\) and each movie has an unknown weight vector \(\mathbf{v}_j \in \mathbb{R}_d\).

The **predicted rating** is

\[ h(i,j) = u_{i1}v_{j1} + u_{i2} v_{j2} + \ldots + u_{id} v_{jd} = \mathbf{u}_i^T \mathbf{v}_j \]

- \(\mathbf{v}_j\) is a vector of features of movie \(j\)
\(\mathbf{u}_i\) is a vector of weights for feature \(i\) that describe their preferences for different features

**Example**: feature 1 describes comedy vs. drama- \(v_{j1}\) is negative \(\rightarrow\) movie \(j\) is a comedy
- \(v_{j1}\) is positive \(\rightarrow\) movie \(j\) is a drama
- \(u_{i1}\) is negative \(\rightarrow\) user \(i\) prefers comedies
- \(u_{i1}\) is positive \(\rightarrow\) user \(i\) prefers dramas

**Example**: feature 2 describes whether a movie is geared toward kids or adults

Unlike previous problems we don’t observe the features *or* weights, and need to learn them both from the observed ratings.

- \(\mathbf{u}_i \in \mathbb{R}^d\), \(i=1,\ldots, m\)
- \(\mathbf{v}_j \in \mathbb{R}^d\), \(j=1,\ldots, n\)

Find parameters such that \(h(i_k, j_k) = \mathbf{u}_{i_k}^T \mathbf{v}_{j_k} \approx r(i_k, j_k)\) for \(k = 1,\ldots,L\) (the training data) and take appropriate measures to not overfit.

Place the user weight vectors \(\mathbf{u}_i\) into the rows of a matrix \(U\) and the movie feature vectors \(\mathbf{v}_j\) into the rows of a matrix \(V\)

\[ U = \begin{bmatrix} -\mathbf{u}_1^T -\\ -\mathbf{u}_2^T -\\ \ldots \\ -\mathbf{u}_m^T -\\ \end{bmatrix} \in \mathbb{R}^{m \times d} \qquad V = \begin{bmatrix} -\mathbf{v}_1^T -\\ -\mathbf{v}_2^T -\\ \ldots \\ -\mathbf{v}_n^T -\\ \end{bmatrix} \in \mathbb{R}^{n \times d} \]

Consider the product \(U V^T\):

\[ \boxed{ \begin{array}{c} \\ U \\ \\ \end{array} } \boxed{ \begin{array}{c} \ \ \ V^T \ \ \ \end{array} } \]

It is easy to check that \((i,j)\) entry of \(UV^T\) is equal to \(\mathbf{u}_i^T \mathbf{v}_j\), which is our prediction for the \((i,j)\) entry of \(R\)

In other words, our model is that \(R \approx U V^T\) (a

**factorization**of \(R\))We choose \(U\) and \(V\) to get good predictions for those entries of \(R\) that we can observe. As long as we don’t overfit, this gives us power to generalize to entries we don’t observe

- Formulate a squared error cost function for this model
- Add regularization for both the user weight vectors \(\mathbf{u}_i\) and the movie feature vectors \(\mathbf{v}_j\)
- Write down the partial derivatives of your cost function with respect to the entries of \(\mathbf{u}_i\) and \(\mathbf{v}_j\)
- Plug the partial derivatives into stochastic gradient descent (SGD) and write down the update rule
- Implement SGD
- Tune parameters to get good performance on the validation set

- Groups of up to 3—work with someone
**new** - Any resources allowed: talk to me, other groups, etc.
- Submit predictions on test set
Evaluation: root-mean squared error (RMSE) on test set

\[ \text{RMSE} = \sqrt{\frac{1}{\text{# test}}\sum_{(i,j) \in \text{test set}} (h(i, j) - r(i,j))^2}\]

Worth 1/3 of regular homework

RMSE grade <= 1.0 80% <= 0.97 90% <= 0.95 95% <= 0.94 100% Members of winning team (lowest RMSE) get 1 free late day

Due date announced at end of class

Matrix Factorization Techniques for Recommender Systems by Yehuda Koren, Robert Bell and Chris Volinsky

Authors were on the winning team of Netflix prize

Paper includes algorithms—but beware different notation

Once you nail the matrix factorization model, here are some ideas to get even better performance.

A simpler model that helps introduce important ideas is the “biases” only model. This has an overall baseline score \(\mu\) and an offset (or “bias”) \(a_i\) for each user as well as a bias \(b_j\) for each movie. The model is:

\[ h(i,j) = \mu + a_i + b_j \]

For example

- Suppose the overall average rating is \(\mu = 3.8\)
- However, alice loves movies, so her bias is \(a_1 = +0.4\)
- Bob is hard to please, so his bias is \(a_2 = -0.7\)
- Silence of the Lambs is hard to watch: \(b_2 = -0.4\)
- Everyone loves WALL-E: \(b_3 = +1.3\)
- Etc.

- \(\mu\)
- \(a_i\), \(i=1,\ldots, m\)
- \(b_j\), \(j=1,\ldots, n\)

To learn these parameters, write down the partial derivatives of the cost function with respect \(\mu\), \(a_i\), and \(b_j\) and plug them into stochastic gradient descent.

The biases only model can be incorporated into the matrix factorization model to improve performance:

\[ h(i,j) = \mu + a_i + b_j + \mathbf{u}_i^T \mathbf{v}_j \]

- \(\mu\)
- \(a_i\), \(i=1,\ldots, m\)
- \(b_j\), \(j=1,\ldots, n\)
- \(\mathbf{u}_i \in \mathbb{R}^d\), \(i=1,\ldots, m\)
- \(\mathbf{v}_j \in \mathbb{R}^d\), \(j=1,\ldots, n\)

To learn these parameters, combine the partial derivatives from the basic matrix factorization model with those from the biases only model and update them all within stochastic gradient descent.