Discussing the Netflix Prize

After my last blog post, I was contacted by a journalist who wanted to discuss the effects of the Netflix prize. It seems that now that the competition is winding to an end, one of the real questions that emerges is whether it was worth it. Below, I’m pasting part of my side of the dialogue; other blogs are posting similar discussions, and I’m curious as to what any of you fellow researchers may have to say.

(Disclaimer) A Comment on My Comments

What the leaders have achieved is remarkable; it’s a huge feat and they deserve the prize. They have also accomplished the 10% improvement while consistently sharing their insights through a number of great publications – I comment without intending to detract from the difficulty of the problem or the great work they have done over the past 2+ years.

Since the contest is in the last-call stage, I can only comment on algorithms that have been disclosed so far – I’m assuming that they haven’t radically changed their methods since winning the 2007 and 2008 progress prizes. Since the qualifying team is a combination of leading teams, it is likely that they have blended the results that each team was getting to make it over the final hurdle.

Will the 10% make a difference to the customers?

Well, it is unlikely that Netflix will be able to use any leading solution from the competition in its entirety. Here’s one example why: many algorithms have been leveraging the so-called ‘temporal effects’ to improve their predictions. Basically, they were looking at how ratings vary according to when they were entered (for example, you may be inclined to rate movies lower on Wednesdays than on Fridays). The hidden test set that competitors had to predict included who (the user id), what (the movie id) and when (date) the rating was input (only the rating needed to be predicted) – so using the temporal effects provided a benefit. However, real systems will not know both the what or when that competitors had available – so unless Netflix is prepared to update its recommendations every day, these temporal effects can not be used.

There is also controversy in the field as to whether RMSE is a good measure (in terms of translating to good recommendations), which has been around for a while. Some researchers have said that accuracy actually hurts recommender systems, others would take the more conservative view and say that accuracy is only one dimension of a good recommendation. Overall, the difficulty of researching collaborative filtering is that we are trying to design algorithms that make people “happy,” by giving them good recommendations – but this is a very (very, very) subjective quality.

In summary: the customers may see a difference – perhaps if only because they know that Netflix has invested over $1 million in improving their algorithm, perhaps because Netflix will include some of the insights gained in the past years – but it is not as simple as taking the algorithm the team has produced and letting it loose on all its customers.

Do you think it’s quite surprising that an improvement of 10% has even proved possible?

No – mainly because of what I said above (the teams use lots of information about, and smart methods to reason about, what they are predicting). They have also been receiving feedback on how well their methods have been performing in all the submissions prior to the one that had more than 10% improvement – and have been tuning according to that feedback. Unfortunately there are still differences between the set up of the competition and how a deployed system will work.

This is related to some recent work presented at UMAP: I think the key insight the work  is that people to exhibit natural unreliability in their rating (and therefore it was possible to question whether 10% was possible or not) – but the algorithms that researchers have been designing assume that the number that people input is the ground truth.

What other factors do you think are important in offering a rental prediction?

There are quite a few I can think of – essentially similar to the reasons why people recommend things to each other. Some ideas have been floating around:

  • Diversity: if you rate Lord of the Rings, are you being recommended Lord of the Rings II and III? Are you being recommended the same movies week after week? Lack of diversity can definitely lead to boring recommendations. (Some unpublished/ in progress work I’ve been doing relates to this).
  • Novelty: How can new (good) movies be spotted and recommended- before everyone rates them?
  • How can new users be given good recommendations before they have rated tons of movies? (Sometimes called the ‘cold-start’ problem)
  • Popularity: There is some work that shows that recommender systems tend to recommend popular things – how can niche content be recommended to the right people?
  • Serendipity: How can users be given ‘surprising’ recommendations – recommendations that lead them to discover and like new movies that they may have been otherwise averse to.

Update: The article is now available, on New Scientist.

7 Responses to “Discussing the Netflix Prize”

  1. A Netflixer says:

    Good post, but quite wrong on the time-effects frontier.
    Accounting for time effects, which explain much of the past data variability, is a particularly strong type of data cleaning.
    This in turn leads to more refined estimation of user/movie characteristics, which reflect the true longer term signal, without interfering transient fluctuations. Such better estimation brings better predictions of future behavior, independently of exact time point in the future.
    For example, if we observe many high ratings by a user on the same day, we would like to discount those, as they merely reflect a particularly good mood of the user on that day, which is kind of a “noisy fluctuation”.

  2. Neal Lathia says:

    Thanks for your comment!

    As you point out, if we observe many high ratings by a user on the same day, we may want to discount them in the future. Similarly, to be more accurate on that day, we may also want to modify our predictions to reflect this (positive) change in behaviour.

    My point was that since the test set includes dates, it was possible to make the latter kind of modifications to predictions. I did not mean to imply that the technique can not be used to cleanse the data after it has been input.

  3. Mike says:

    Interesting points about diversity, popularity and serendipity. Perhaps those issues can be addressed not by making more accurate predictions, but by using the predicitons differently? For example, don’t show me the ten films I would have rated highest – show me the ten for which my rating would have most exceeded the average rating for that film.

    But as you pointed out, if we want to move beyond accuracy we need new metrics. Are we interested in whether recommended items end up in the shopping basket (Amazon) or the harder to quantify question of whether good recommendations keep people interested (Netflix)? Is there an inherent value in introducing people to niche content, or does it make economic sense to allow a few popular items to become ever more popular?

  4. Neal Lathia says:

    If you look at the comments under the New Scientist article (link above), the readers are having a similar discussion – about how recommender systems are good “if you are looking for more of the same” [but for] “exploration and experimentation, however, it will still be up to friends, and chance, to find new gems.”

    Unfortunately, I think it is not only a question of new metrics – the data itself doesn’t differentiate between a 5* for something good that was more of the same and a 5* for something good that is serendipitous. Time to get the users more involved in evaluations?

  5. I would say that the two things co-exist – 1) make popular items more popular; 2) serve niche markets (the long tail)

  6. > Time to get the users more involved in evaluations?
    I can see opening up a research stream ;-)

  7. Oscar Celma says:

    Hi Neil & co.

    Regarding the 10.05% and the (still hidden) winning algorithm, if I were Netflix I would go for a “simple” (let’s leave it like this for now) SVD or NMF-like approach that scores, say 9.70% but it’s ten times faster to implement, and it runs twice as faster (I just made up the numbers, but I guess you know what I mean).

    So, there are tons of other factors, as already mentioned here and in the New Scientist, that are important when creating a whole recommender system.

    Furthermore, I’m pretty sure than adding up the expenses (salaries, machines, etc.) of the winning teams during these ~3 years, it clearly exceeds $1,000,000!. Well, at least the involved institutions can get their money back :-)

    Last but not least, I never understood why is so exciting (apart from the pri$$$e) to focus on predicting a value in the range of [0..5], and forgetting about (as you mention): novelty, diversity and serendipity.

    In the music recommendation domain, the users can listen to the recommendations, so I’d like to have here:
    - a link to youtube (or any other site) with the preview of the movie,
    - a link to IMDB
    - a link to Rotten Tomatoes
    - etc.
    This way I can figure it out whether the recommendation was worth it or not.

    Well, I think it’s enough rambling for a sunny Saturday morning, isn’t it?
    Looking forward to more CHI-oriented paper about recommendations :-)

    Cheers, Oscar