Given explicit ratings from users on movies they like, we can use collaborative filtering to recommend other movies which they haven’t watched yet but which have been rated highly by other users with similar interests. The dataset that I will be using is a subset of the movie ratings data from the Netflix Prize, which can be found at www.netflixprize.com
The files have the following format: MovieID, CustomerID, and Rating. The rating can take five different values from 1 to 5. For some reason, the ratings are rounded to the nearest integer. I’ve given movies a rating of 3 and a half stars on Netflix, and this is likely to be more useful than rounding to the nearest integer.
A common way to represent data in recommendation systems is in the form of a utility matrix, such as the one below (taken from Chapter 9 of “Mining of Massive Datasets”):
The rows represent users and the columns represent the movies. Most user-movie pairs have blanks, meaning that the user has not rated the movie. This is quite common and in practice the matrix would be even sparser, with a typical user rating only a small fraction of all available movies.
To recommend items, we use the notion of similarities between items and users. Some of the different similarity measures that are used are – Jaccard Similarity, Cosine Similarity, and the Pearson Correlation Coefficient. When we have explicit ratings (as we do here), the Pearson co-efficient is the best measure among the three.
We will use the Pearson correlation coefficient here. The following equations describe how we would compute this similarity, and the predicted rating of a user on a movie he/she hasn’t watched yet:
And that’s it! I used the above equations to code up an implementation of collaborative filtering in Python. On the test set I have, I get an RMSE of 0.89 and a MAE (Mean Absolute Error) of 0.69.
Collaborative Filtering for Pandora
Imagine that you have a dataset from Pandora, Spotify or YouTube, and that you have users and songs/videos that these users have played but not explicit ratings. Does the algorithm from the previous section work?
The answer is no, and this is because the Pearson correlation coefficient does not work for implicit ratings. Here we do not have explicit ratings in the utility matrix, but rather 1’s and 0’s that signify whether or not a user has played a song. However, we can still use collaborative filtering through the use of Jaccard similarity, and the pseudocode is below:
Using the pseudocode above, we can make song/video recommendations to users. Collaborative filtering is widely used in services such as Reddit, Youtube, Pandora, and Spotify.
You can find my code and the data here.