Dan Sheldon
Gladiator | Silence of the Lambs | WALL-E | Toy Story | |
---|---|---|---|---|
Alice | 5 | 4 | 1 | |
Bob | 5 | 2 | ||
Carol | 5 | |||
David | 5 | 5 | ||
Eve | 5 | 4 |
What movie should I recommend to Bob?
Will Carol like WALL-E?
Goal: Fill in entries of the “rating matrix”
We only get to see some of the entries of the rating matrix and want to fill in the rest.
Our data is a list of \(L\) ratings specified as follows:
Example: in our original example we observed \(L=10\) ratings
\(k\) | \(i_k\) | \(j_k\) | \(r(i_k, j_k)\) | Comment |
---|---|---|---|---|
1 | 1 | 1 | 5 | (Alice, Gladiator, 5) |
2 | 1 | 2 | 4 | (Alice, Silence, 4) |
3 | 1 | 3 | 1 | (Alice, WALL-E, 1) |
4 | 2 | 2 | 5 | (Bob, Silence, 5) |
… | … . | .. | … | |
9 | 5 | 1 | 5 | (Eve, Gladiator, 5) |
10 | 5 | 2 | 4 | (Eve, Silence, 4) |
Assume each user has an unknown weight vector \(\mathbf{u}_i \in \mathbb{R}^d\) and each movie has an unknown weight vector \(\mathbf{v}_j \in \mathbb{R}_d\).
The predicted rating is
\[ h(i,j) = u_{i1}v_{j1} + u_{i2} v_{j2} + \ldots + u_{id} v_{jd} = \mathbf{u}_i^T \mathbf{v}_j \]
\(\mathbf{u}_i\) is a vector of weights for feature \(i\) that describe their preferences for different features
Example: feature 2 describes whether a movie is geared toward kids or adults
Unlike previous problems we don’t observe the features or weights, and need to learn them both from the observed ratings.
Find parameters such that \(h(i_k, j_k) = \mathbf{u}_{i_k}^T \mathbf{v}_{j_k} \approx r(i_k, j_k)\) for \(k = 1,\ldots,L\) (the training data) and take appropriate measures to not overfit.
Place the user weight vectors \(\mathbf{u}_i\) into the rows of a matrix \(U\) and the movie feature vectors \(\mathbf{v}_j\) into the rows of a matrix \(V\)
\[ U = \begin{bmatrix} -\mathbf{u}_1^T -\\ -\mathbf{u}_2^T -\\ \ldots \\ -\mathbf{u}_m^T -\\ \end{bmatrix} \in \mathbb{R}^{m \times d} \qquad V = \begin{bmatrix} -\mathbf{v}_1^T -\\ -\mathbf{v}_2^T -\\ \ldots \\ -\mathbf{v}_n^T -\\ \end{bmatrix} \in \mathbb{R}^{n \times d} \]
Consider the product \(U V^T\):
\[ \boxed{ \begin{array}{c} \\ U \\ \\ \end{array} } \boxed{ \begin{array}{c} \ \ \ V^T \ \ \ \end{array} } \]
It is easy to check that \((i,j)\) entry of \(UV^T\) is equal to \(\mathbf{u}_i^T \mathbf{v}_j\), which is our prediction for the \((i,j)\) entry of \(R\)
In other words, our model is that \(R \approx U V^T\) (a factorization of \(R\))
We choose \(U\) and \(V\) to get good predictions for those entries of \(R\) that we can observe. As long as we don’t overfit, this gives us power to generalize to entries we don’t observe
Evaluation: root-mean squared error (RMSE) on test set
\[ \text{RMSE} = \sqrt{\frac{1}{\text{# test}}\sum_{(i,j) \in \text{test set}} (h(i, j) - r(i,j))^2}\]
Worth 1/3 of regular homework
RMSE | grade |
---|---|
<= 1.0 | 80% |
<= 0.97 | 90% |
<= 0.95 | 95% |
<= 0.94 | 100% |
Members of winning team (lowest RMSE) get 1 free late day
Due date announced at end of class
Matrix Factorization Techniques for Recommender Systems by Yehuda Koren, Robert Bell and Chris Volinsky
Authors were on the winning team of Netflix prize
Paper includes algorithms—but beware different notation
Once you nail the matrix factorization model, here are some ideas to get even better performance.
A simpler model that helps introduce important ideas is the “biases” only model. This has an overall baseline score \(\mu\) and an offset (or “bias”) \(a_i\) for each user as well as a bias \(b_j\) for each movie. The model is:
\[ h(i,j) = \mu + a_i + b_j \]
For example
To learn these parameters, write down the partial derivatives of the cost function with respect \(\mu\), \(a_i\), and \(b_j\) and plug them into stochastic gradient descent.
The biases only model can be incorporated into the matrix factorization model to improve performance:
\[ h(i,j) = \mu + a_i + b_j + \mathbf{u}_i^T \mathbf{v}_j \]
To learn these parameters, combine the partial derivatives from the basic matrix factorization model with those from the biases only model and update them all within stochastic gradient descent.