Archive for the ‘dataset’ Category

Social Graph API by Google

Monday, February 4th, 2008

The new Social Graph API “makes information about the public connections between people on the Web easily available and useful. You can make it easy for users to bring their existing social connections into a new website and as a result, users will spend less time rebuilding their social networks and more time giving your app the love it deserves”.

API docs & Google’s post.

Spam Dataset

Monday, January 21st, 2008

WEBSPAM-UK2007 ” is a large collection of annotated spam/nonspam hosts labeled by a group of volunteers. The base data is a set of 105,896,555 pages in 114,529 hosts in the .UK domain downloaded by the Laboratory of Web Algorithmics of the University of Milano. The assessment was done by a group of volunteers.

For the purpose of the Web Spam Challenge 2008, the labels are being released in two sets. SET1, containing roughly 2/3 of the assessed hosts will be given for training, while SET2 containing the remaining 1/3, will be held for testing. More information about the Web Spam Challenge 2008, co-located with AIRWeb 2008 will be available soon” here and here.

Netflix Prize dataset de-anonymised

Wednesday, December 19th, 2007

Two researchers at the University of Texas have de-anonymised (re-nymised? nymified?) the Netflix Prize dataset.

Netflix: Winners and Losers

Friday, December 14th, 2007

By now, the news has spread around that team BellKor has won this year’s Netflix progress prize ($50,000) by achieving an 8.43% improvement over Cinematch (the grand prize is less than 2% away). Their current solution is described here; and perhaps the most interesting thing about it is the first sentence. “Our final solution (RMSE=0.8712) consists of blending 107 individual results.” It is also interesting to note that the second place on the leader board is team KorBell; I assume that this is because Netflix has restricted each team to one submission per day.

A natural question to ask, therefore, (other than the one about how many teams may have multiple names and can/are trying and infer what the qualifying ratings actually are) is that perhaps this race for accuracy is developing methods that are perfectly suitable for the qualifying data but not necessarily for the rest! It becomes a problem of overfitting. To quote wikipedia, a brute-force approach aimed at accuracy could develop a method that “reduces or destroys the ability of the model to generalize beyond the fitting data!” In other words, once they unleash the winning algorithm on the rest of their data, will they maintain the 10% improvement over Cinematch?

My work in recent weeks has been following up on a previous paper, by exploring the (lack of) information that a lonely RMSE or MAE can give us about how well collaborative filtering is performing: we know nothing about how much the predictions are dispersed around the mean, how error evolves over time, and are not considering a number of other aspects that should be close to our heart. More coming up on that soon. But in the mean time, I made my first submission to the Netflix prize site to see how well the Java random number generator would perform. My incredible predictions were made using nextInt(5)+1. I achieved an RMSE of 1.93, and hopefully no team has performed worse than me.

Just out of curiosity, I got RMSE 1.92 on the probe set using the same technique; I haven’t read anywhere about the extent to which the probe set offers a good idea as to how well qualifying performance will be. Further predictions on the probe set, based on a random number between 3 and 5, or (nextDouble()*2) + 3, (since rating distribution is skewed towards the positive end in these datasets) improved my losing streak to RMSE 1.31. Lastly, simply returning the average rating for each movie gets RMSE 1.13. So if anyone out there is doing this well with crazy matrix operations, you might want to rethink your strategy :)

Visualization tool for data-sets

Friday, November 16th, 2007

This is something I found while looking for social networking datasets:
Many Eyes” has range of different datasets( e.g. facebook, secondlife, etc) and also provides users with different styles of visualizations. I guess it is quite useful when you want to visualize a small dataset (fraction of traces) without the need of coding visualization parts.

Measurement and Analysis of Online Social Networks

Wednesday, November 14th, 2007

At IMC, it has been presented the first study to examine multiple online social networks at scale. The paper analyzes “data gathered from four popular online social networks: Flickr, YouTube, LiveJournal, and Orkut”.


  • “the indegree of user nodes tends to match the outdegree;
  • the networks contain a densely connected core of high-degree nodes;
  • this core links small groups of strongly clustered, low-degree nodes at the fringes of the network”.

Implications on info dissemination and search

  • “The existence of a small, well-connected core implies that information seeded via a core node will rapidly spread through the entire network.”
  • “Similarly, searches that proceed along social network links will quickly reach the core. This suggests that simple unstructured search algorithms could be designed if the core users were to store some state about other users.”

Implications on trust
“In a social network, the underlying user graph can potentially be used as a means to infer some level of trust in an unknown user, to check the validity of a public key certificate, and to classify potential spam”.

  • “The tight core coupled with link reciprocity implies that users in the core appear on a large number of short paths. Thus, if malicious users are able to penetrate the core, they can skew many trust paths (or appear highly trustworthy to a large fraction of the network).”
  • “However, these two properties also lead to small path lengths and many disjoint paths, so the trust inference algorithms should be adjusted to account for this observation. In particular, given our data, an unknown user should be highly trusted only if multiple short disjoint paths to the user can be discovered.”
  • “The correlation in link degrees implies that users in the fringe will not be highly trusted unless they form direct links to other users. The “social” aspect of these networks is selfreinforcing: in order to be trusted, one must make many “friends”, and create many links that will slowly pull the user into the core.”

Creating animations of bluetooth network traces

Thursday, September 6th, 2007

Recently I have been working with Dimitris Moustakas to create a set of animations from bluetooth network traces. The animations are based on a dataset managed by and downloaded from CRAWDAD. The data was originally collected as part of the Reality Mining Project at MIT in which 100 participants were given bluetooth-enabled mobile phones and encouraged to carry them around over the course of the 2004-2005 academic year. Special software on the phones recorded bluetooth connections between devices. Specifically, if a mobile phone was within range of another and a connection was established then the start and end times of the connection and the device identification number were logged. The visualisations were created using SoNIA, a Java-based social network animation tool.

Mobility traces

Monday, August 6th, 2007

From Vassilis: “As part of the Cityware project, we have released a Facebook application that provides an interface to Bluetooth mobility traces. Our application allows users to explore who they come in contact with, how often, and for what duration.

Additionally, we have released a small software utility that allows users to capture their own traces and upload them to our servers.

The application is freely available for anyone to use, provided you have a Facebook account”.

To register, please visit this.