Archive for the ‘evaluation’ Category

Evaluating search engines: blindsearch

Thursday, June 11th, 2009

BlindSearch is a simple and neat site that collects ‘objective’ opinions on search quality by showing query results from Google, Yahoo and Bing side by side without identifying which is which and inviting you to select the best.

Yahoo Datasets (Webscope)

Tuesday, May 5th, 2009

Yahoo has recently made publicly available a  huge catalog of datasets (data on ratings, language, graphs, and advertising)

Temporal Collaborative Filtering

Tuesday, April 28th, 2009

As part of my recent work on collaborative filtering (CF), I’ve been examining the role that time plays in recommender systems. To date, the most notable use of temporal information (if you’re familiar with the Netflix prize) is that researchers are using time(stamps) to inch their way closer to the million dollar reward. The idea is to use how user-ratings vary according to, for example, the day of the week they were input in order to better predict the probe (and more importantly, the qualifying) datasets. I suppose my only criticism here is that once the million dollars has been won, nobody is going to implement and deploy this aspect of the algorithm (unless you are prepared to update your recommendations every day?) – since, in practice, we do not know when users are going to rate items.

In the context of my work, I’ve been looking at 2 areas; the effect of time on (1) similarity between users (RecSys ’08), and (2) the recommender system itself. Here’s a brief summary of (2): (more…)

Mapping Social Networks (with APIs)

Monday, April 6th, 2009

There seem to be many reasons why people connect online. For example, on Twitter, I have connected to friends, colleagues, family, people I have met at conferences (or simply know from some of the work),  and a couple celebrities (like Tom Waits). These few reasons encompass a largely incomplete list of why two people may connect on a social network; of course, understanding why people connect to each other would give insight into suggesting new connections for people to make… (more…)

Similarity Graphs

Thursday, February 26th, 2009

The idea of reasoning about content to recommend as a similarity graph is quite widespread. Broadly speaking, you can start by drawing a set of circles (for users) on the left and a set of circles (for “items” – songs, movies..) on the right; when users rate/listen to/etc items, you draw an arrow from the corresponding left circle to the right circle (i.e. a bipartite graph).  What collaborative filtering algorithms can do is project the two-sided graph to two equivalent representations, where users are linked to other users, and items are linked to other items based on how similar they are.

There are a bunch of places where this kind of abstraction has been used; for example, Oscar Celma used graphs to navigate users when discovering music in the long-tail. Paul Lamere posted graphs made with the EchoNest API on his blog. I’ve also dabbled in this area a bit, but not using music listening data; I was using (the more traditional) MovieLens and Netflix datasets. The question that comes to mind when reading about techniques that operate on the graph, though, is: are the underlying graphs real representations of similarity between content? What if the graphs are wrong? (more…)

Crowdsourcing User Studies With Mechanical Turk

Tuesday, February 10th, 2009

We just finished our reading session of “Crowdsourcing User Studies With Mechanical Turk” (pdf). Very interesting paper. Few hand-written notes on which type of tasks we would run on the MechTurk.


Friday, January 9th, 2009

I don’t know how it works but it seems to be a nice mobile testbed. Anyone knows more?

“An award-winning, ground-breaking product designed by Mobile Complete, DeviceAnywhere™ provides developers real-time interaction with handsets that are connected to live global networks. Built on Mobile Complete’s innovative device interaction technology, Direct-To-Device™, DeviceAnywhere enables you to connect to and control mobile devices around the world – using just the Internet. Through DeviceAnywhere’s original, non-simulated, real-time platform, you can remotely press buttons, view LCD displays, listen to ringers and tones, and play videos … just as if you were holding the device in your hands!”

Media Slant

Friday, November 28th, 2008

Say that you have to answer this research question “Does the market discourage biased reporting (media slant)? Or does the market encourage it? Matthew Gentzkow and Jesse Shapiro, two economists at the University of Chicago’s business school, set out to test this proposition, and The Economist reported on their research in just two pages. Gentzkow’s and Shapiro’s methodology is smart and imaginative! Here you go:


Solving Research Problems

Tuesday, September 9th, 2008

There’s an interesting post on techcrunch based on a comment by Google’s Marissa Mayer, who apparently said that search is “90-95% solved.” Regardless of the context in which this comment was made, the post’s response is that the problem is not solved: it then outlines a number of areas where information retrieval has yet to succeed. However, (reminiscent of a short question that appeared in the panels of RecSys 2007) there are other questions to ask: when is ANY research question “solved?” What does it mean for a problem (like information retrieval or collaborative filtering) to be “solved?”


Flickr Places

Friday, August 29th, 2008

Flickr Places “is a method of exploring Flickr with geo-specific pages. The page shows the most interesting photos for a location (iconic photos they call them), the most recent and common tags for the photos and the most prolific photo groups. It creates a separate page for each geographic location with a unique human-readable URL. Places go down to the city level so San Francisco, Seattle, and London will each have their own page and unique URL. In time they will go deeper. Places will be accessible via the Flickr API.” More here and here.

levitra online prescription

From this project, data useful for evaluation could come out !

Dataset and R code for our paper on genres/artists affinity

Tuesday, August 19th, 2008

Justin Donaldson and I have a paper at ISMIR entitled “Uncovering affinity of artists to multiple genres from social behaviour data”. The paper details a project we worked on for the past year or so involving popular music listening activity from a pool of MusicStrands users.

We provide not only the paper, but also the dataset and the code used in our analysis. All of this is available at the website we have set up for the project:

The main contribution of the project is an analysis and illustration of genres as “fuzzy sets” rather than boolean labels. Through a co-occurence analysis of hundreds of thousands of user playlists, a frequency based “affinity” metric is formed between artists and genres. This affinity metric is a more detailed expression of the style of a given artist’s music. The idea and awareness of predominant genres are a trivial part of any person’s understanding of the vast corpus of popular music. However, genres typically are used as Boolean categorical labels. I.e. an artist is understood to be associated with only one given genre.

By expressing a connection to multiple genres through our affinity metric, a more detailed picture of the artist emerges. We give a lot more examples in the website, so be sure to check it out. -


Claudio Baccigalupo

Evaluating Mobile Solutions – WWW’08 to the rescue

Tuesday, July 8th, 2008

To evaluate new mobile content discovery approaches, one needs to understand:

1) What mobile users query for:

2) How interests distribute across mobile users (who befriend each other):

Debating the Long Tail

Sunday, June 29th, 2008

I’ve just read an article from Anita Elberse titled “Should we invest in the long tail?”, published in the Harvard Business Review (no direct link, google for it). Based on what appears to be a very rigorous and extensive study, the author reports conclusions which seem to go in the opposite direction of what stated in the famous book from Chris Anderson “The Long tail”.


Evaluating our smart algorithms

Saturday, March 15th, 2008

Many of us are designing smart algorithms and are often supposed to evaluate them by carrying out well-designed user studies.

Problem: Those studies are expensive and, consequently, we tend to trade off between sample size, time requirements, and monetary costs.

Proposal (by PARC researchers) : To collect user measurements from micro-task markets (such as Amazon’s Mechanical Turk). Here is their blog post (which comments on their upcoming HCI paper titled “Crowdsourcing User Studies With Mechanical Turk“).

Note: Often, users are irrational.