By now, the news has spread around that team BellKor has won this year’s Netflix progress prize ($50,000) by achieving an 8.43% improvement over Cinematch (the grand prize is less than 2% away). Their current solution is described here; and perhaps the most interesting thing about it is the first sentence. “Our final solution (RMSE=0.8712) consists of blending 107 individual results.” It is also interesting to note that the second place on the leader board is team KorBell; I assume that this is because Netflix has restricted each team to one submission per day.
A natural question to ask, therefore, (other than the one about how many teams may have multiple names and can/are trying and infer what the qualifying ratings actually are) is that perhaps this race for accuracy is developing methods that are perfectly suitable for the qualifying data but not necessarily for the rest! It becomes a problem of overfitting. To quote wikipedia, a brute-force approach aimed at accuracy could develop a method that “reduces or destroys the ability of the model to generalize beyond the fitting data!” In other words, once they unleash the winning algorithm on the rest of their data, will they maintain the 10% improvement over Cinematch?
My work in recent weeks has been following up on a previous paper, by exploring the (lack of) information that a lonely RMSE or MAE can give us about how well collaborative filtering is performing: we know nothing about how much the predictions are dispersed around the mean, how error evolves over time, and are not considering a number of other aspects that should be close to our heart. More coming up on that soon. But in the mean time, I made my first submission to the Netflix prize site to see how well the Java random number generator would perform. My incredible predictions were made using nextInt(5)+1. I achieved an RMSE of 1.93, and hopefully no team has performed worse than me.
Just out of curiosity, I got RMSE 1.92 on the probe set using the same technique; I haven’t read anywhere about the extent to which the probe set offers a good idea as to how well qualifying performance will be. Further predictions on the probe set, based on a random number between 3 and 5, or (nextDouble()*2) + 3, (since rating distribution is skewed towards the positive end in these datasets) improved my losing streak to RMSE 1.31. Lastly, simply returning the average rating for each movie gets RMSE 1.13. So if anyone out there is doing this well with crazy matrix operations, you might want to rethink your strategy