Collaborative Filtering on Netflix Data to Predict User Ratings of Movies

Given explicit ratings from users on movies they like, we can use collaborative filtering to recommend other movies which they haven’t watched yet but which have been rated highly by other users with similar interests. The dataset that I will be using is a subset of the movie ratings data from the Netflix Prize, which can be found at

The subset I am using is located here. The README file from the original set of Netflix files is also included to comply with the terms of use for this data.

The files have the following format: MovieID, CustomerID, and Rating. The rating can take five different values from 1 to 5. For some reason, the ratings are rounded to the nearest integer. I’ve given movies a rating of 3 and a half stars on Netflix, and this is likely to be more useful than rounding to the nearest integer.

A common way to represent data in recommendation systems is in the form of a utility matrix, such as the one below (taken from Chapter 9 of “Mining of Massive Datasets”):


The rows represent users and the columns represent the movies. Most user-movie pairs have blanks, meaning that the user has not rated the movie. This is quite common and in practice the matrix would be even sparser, with a typical user rating only a small fraction of all available movies.

To recommend items, we use the notion of similarities between items and users. Some of the different similarity measures that are used are – Jaccard Similarity, Cosine Similarity, and the Pearson Correlation Coefficient. When we have explicit ratings (as we do here), the Pearson co-efficient is the best measure among the three.

We will use the Pearson correlation coefficient here. The following equations describe how we would compute this similarity, and the predicted rating of a user on a movie he/she hasn’t watched yet:



And that’s it! I used the above equations to code up an implementation of collaborative filtering in Python. On the test set I have, I get an RMSE of 0.89 and a MAE (Mean Absolute Error) of 0.69.

Collaborative Filtering for Pandora

Imagine that you have a dataset from Pandora, Spotify or YouTube, and that you have users and songs/videos that these users have played but not explicit ratings. Does the algorithm from the previous section work?

The answer is no, and this is because the Pearson correlation coefficient does not work for implicit ratings. Here we do not have explicit ratings in the utility matrix, but rather 1’s and 0’s that signify whether or not a user has played a song. However, we can still use collaborative filtering through the use of Jaccard similarity, and the pseudocode is below:

PandoraUsing the pseudocode above, we can make song/video recommendations to users. Collaborative filtering is widely used in services such as Reddit, Youtube, Pandora, and Spotify.


You can find my code and the data here.


One Reply to “Collaborative Filtering on Netflix Data to Predict User Ratings of Movies”

  1. Very interesting tool
    It’s more relevant for India as large no of net savvy youngsters watch movies in cities and town in multiplexes and would like watch gd rated movies by genuine Movìe goers rather than depending on so called critics who rate films mostly biased


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s