Archive for December, 2007

Redefining Information Overload

Thursday, December 27th, 2007

The other day I was sitting at Gatwick airport waiting for my flight home to Italy to spend Christmas with my family. I got my flight with Easyjet- and when I bought the ticket online I was also able to sign up to one of their new, free text-messaging services:

  • Some of the texts were very helpful: the morning of my flight I received a text with my flight details and confirmation number, information that I may usually scribble on a piece of paper or the back of my hand. Result: no paper, and clean hands (happier parents?)
  • Some of the texts could have made us of some location information: a text said (in a nice way) “go to your gate” … umm, should I reply to the computer and tell it I’m already there?
  • Other texts were interesting, but I didn’t need them: “Use this text to get 0% commission on currency exchange.” I have some Euros in my pocket. Can you send me this text again when I do need Euros? (Maybe I’ll tell you when?)
  • Other texts were just useless. “Go to shop X and get Y% discount with this text.” I won’t say what the shop is, let’s just leave it at the fact that its contents don’t quite fit my profile (specifically gender). Why do you keep interupting me from the book I was reading to give me this useless advertisement? My only current solution is to unsubscribe- but I’ll lose all the information I liked then! (more…)

Call for PhD thesis on Trust Management

Friday, December 21st, 2007

ERCIM STM WG launched an award for the best Ph.D. Thesis on Security and Trust Management (discussed in 2007) (pdf, doc)

Netflix Prize dataset de-anonymised

Wednesday, December 19th, 2007

Two researchers at the University of Texas have de-anonymised (re-nymised? nymified?) the Netflix Prize dataset.

Workshop on Online Social Networks

Tuesday, December 18th, 2007

There was a workshop at on the impact of social networks on telecommunication networks. Here is the (interesting) .

New satellite navigation system may save firefighters

Monday, December 17th, 2007

“A new tracking system to pinpoint people inside smoked-filled buildings has been developed in a move that should slash the risks faced by firefighters. French aerospace company Thales said on Wednesday its Indoor Positioning System (IPS) was aimed initially at helping fire services, although it could also be used by the police and armed forces. A Thales spokeswoman said the new system was based on a new kind of radio signal, called Ultra Wide Band, designed for very short range and high data-rate links”. Source

Netflix: Winners and Losers

Friday, December 14th, 2007

By now, the news has spread around that team BellKor has won this year’s Netflix progress prize ($50,000) by achieving an 8.43% improvement over Cinematch (the grand prize is less than 2% away). Their current solution is described here; and perhaps the most interesting thing about it is the first sentence. “Our final solution (RMSE=0.8712) consists of blending 107 individual results.” It is also interesting to note that the second place on the leader board is team KorBell; I assume that this is because Netflix has restricted each team to one submission per day.

A natural question to ask, therefore, (other than the one about how many teams may have multiple names and can/are trying and infer what the qualifying ratings actually are) is that perhaps this race for accuracy is developing methods that are perfectly suitable for the qualifying data but not necessarily for the rest! It becomes a problem of overfitting. To quote wikipedia, a brute-force approach aimed at accuracy could develop a method that “reduces or destroys the ability of the model to generalize beyond the fitting data!” In other words, once they unleash the winning algorithm on the rest of their data, will they maintain the 10% improvement over Cinematch?

My work in recent weeks has been following up on a previous paper, by exploring the (lack of) information that a lonely RMSE or MAE can give us about how well collaborative filtering is performing: we know nothing about how much the predictions are dispersed around the mean, how error evolves over time, and are not considering a number of other aspects that should be close to our heart. More coming up on that soon. But in the mean time, I made my first submission to the Netflix prize site to see how well the Java random number generator would perform. My incredible predictions were made using nextInt(5)+1. I achieved an RMSE of 1.93, and hopefully no team has performed worse than me.

Just out of curiosity, I got RMSE 1.92 on the probe set using the same technique; I haven’t read anywhere about the extent to which the probe set offers a good idea as to how well qualifying performance will be. Further predictions on the probe set, based on a random number between 3 and 5, or (nextDouble()*2) + 3, (since rating distribution is skewed towards the positive end in these datasets) improved my losing streak to RMSE 1.31. Lastly, simply returning the average rating for each movie gets RMSE 1.13. So if anyone out there is doing this well with crazy matrix operations, you might want to rethink your strategy :)

A Hunger for Books (Not Blogs)

Monday, December 10th, 2007

I was recently reading Dorris Lessing’s Nobel Prize for Literature acceptance speech (full text here). It’s an excellent read, a story of storytelling, that recalls her experiences in Africa and the influence of books on a writer. I strongly recommend all to read it. However, as inspirational as it is, there is also a strong feeling of cynicism towards the culture heralded on by the technology revolution, and some points worth thinking about. Here is a short quote:

“We are in a fragmenting culture, where our certainties of even a few decades ago are questioned and where it is common for young men and women, who have had years of education, to know nothing of the world, to have read nothing, knowing only some speciality or other, for instance, computers.

What has happened to us is an amazing invention – computers and the internet and TV. It is a revolution. This is not the first revolution the human race has dealt with. The printing revolution, which did not take place in a matter of a few decades, but took much longer, transformed our minds and ways of thinking. A foolhardy lot, we accepted it all, as we always do, never asked: “What is going to happen to us now, with this invention of print?” In the same way, we never thought to ask, “How will our lives, our way of thinking, be changed by the internet, which has seduced a whole generation with its inanities so that even quite reasonable people will confess that, once they are hooked, it is hard to cut free, and they may find a whole day has passed in blogging etc?”

I found it strange how she describes the printing revolution as something good, by allowing “voices unheard” to wield their talent, but the internet revolution as something meaningless, fragmenting, and wasteful. It seems to either imply that publishers have been given the divine gift of knowing what is good to publish, or that people (are dumb, and) lack the collective knowledge to find what is worth reading. Is there no such thing as collective wisdom? Does a change of medium naturally imply a change in content and quality? Perhaps her words reflect Toffler’s predictions, or it is impossible for her to find any quality in the chaotic community that we call the web?

After all, her message came to me by means of blogs and online news. Any thoughts?

ICDM 2007

Wednesday, December 5th, 2007

I attended ICDM (a data mining conference) this year. Since I cannot comment on all the papers I’ve found interesting, here is the full program and my comments on very few papers follow ;-)

6 Full papers

1) Improving Text Classification by Using Encyclopedia Knowledge

Existing methods for classifying text do not work well. That is partly because there are many terms that are (semantically) related but do not co-occur in the same documents. To capture the relationships among those terms, one should use a thesaurus. Pu Wand et al. built a huge thesaurus from Wikipedia and showed that classification benefits from its use.

2) Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights
The most common form of collaborative filtering consists of three major steps:
(1) data normalization, (2) neighbour selection, and (3) determination of interpolation weights. Bell and Koren showed that different ways of carrying out the 2nd step do not impact the predictive accuracy. They then revisited the remaining two steps – they revisited:
+ the 1st step by removing 10 “global effects” that cause substantial data variability and mask fundamental relationships between ratings,
+ and the 3rd step by computing interpolation weights as a global solution to an optimization problem.
By using these revisions, they considerably improved predictive accuracy, so much so that they won the Netflix Progress Prize.

3) Lightweight Distributed Trust Propagation
Soon individuals will be able to share digital content (eg, photos, videos) using their portable devices in a fully distributed way (without relying on any server). We presented a way with which portable devices distributely select content from reputable sources (as opposed to previous work that focuses on centralized solutions).

4) Analyzing and Detecting Review Spam
Here Jindal and Liu proposed an effective way for detecting spam of product reviews.

5) Co-Ranking Authors and Documents in a Heterogeneous Network
Existing ways of ranking network nodes (eg, PageRank) work on homogeneous networks (networks whose nodes represent the same kind of entity, eg, nodes of a citation network usually represent publications). But most networks are heterogeneous (eg, a citation network may well have nodes that are either publications or authors). To rank nodes of heterogeneous networks, Zhou et al. proposed a way that couples random walks. In a citation network, this translates into two random walks that separately rank authors and publications (rankings of publications and their authors depend on each other in a mutually reinforcing way).

6) Temporal analysis of semantic graphs using ASALSAN
Say we have a large dataset of emails that employees of a company (eg, of Enron) have exchanged. To make sense of that dataset, we may represent it as a (person x person) matrix and decompose that matrix to learn latent features. Decompositions (eg, SVD) usually work on a two-dimensional matrix. But say that we also know WHEN emails have been sent. That is, we have a three-dimensional matrix – (person x person x time) matrix. Bader et al. showed how to decompose 3-dimensional matrices.

1 Short paper
1) Trend Motif: A Graph Mining Approach for Analysis of Dynamic Complex Networks
Jin et al. proposed a way of mining complex networks whose edges have weights that change over time. More specifically, they extract temporal trends – trends of how weights change over time.

2 Workshop Papers
1) Aspect Summarization from Blogsphere for Social Study
Researchers have been able to classify sentiments of blog posts (eg, whether posts contain positive or negative reviews). Chang and Tsai built a system that marks a step forward – the ability to extract opinions from blog posts. In their evaluation, they showed how their system is able to extract pro and con arguments about abortion and gay marriage from real blog posts.

2) SOPS: Stock Prediction using Web Sentiment
To predict stock values, traditional solutions solely rely on past stock performance. To make more informed predictions, Sehgal and Song built a system that scans financial message boards, extracts sentiments expressed in them, and then learns the correlation between sentiment and stock performance. Upon what it learns, the system makes predictions that are more accurate than those of traditional methods.

3 Invited Talks (here are some excerpts from the speakers’ abstracts)
1) Bricolage: Data at Play (pdf)
There are a number of recently created websites (eg, Swivel, Many Eyes, Data 360) that enable people to collaboratively post, visualize, curate and discuss data. These sites “take the premise that communal, free-form play with data can bring to the surface new ideas, new connections, and new kinds of social discourse and understanding. Joseph M. Hellerstein focused on opportunities for data mining technologies to facilitate, inspire, and take advantage of communal play with data.

2) Learning from Society (htm)
Ian Witten illustrated how learning from society in the Web 2.0 world will provide some of the value-added functions of the librarians who have traditionally connected users with the information they need. He also stressed the importance of designing functions that are “open” to the world in contract to the unfathomable “black box” that conceals the inner workings of today’s search engines.

3) Tensor Decompositions and Data Mining (pdf)
Matrix decompositions (such as SVD) are useful (eg, for ranking web pages, for recognizing faces) but are restricted to two-way tabular data. In many cases, it is more natural to arrange data into an N-way hyperrectangle and decompose it by using what is called a tensor decomposition. Tamara Kolda discussed several examples of tensor decompositions being used for hyperlink analysis for web search, computer vision, bibliometric analysis, cross-language document clustering, and dynamic network traffic analysis.

Trust, Mobisys, and more

Wednesday, December 5th, 2007

I’ve just finished to give a presentation. I talked about old stuff – TRULLO (pdf, post) and distributed trust propagation (pdf, post). So I recycled old slides – only the first 20 slides (below) were brand new ;-) Thanks to Neal and Elisa!


Modelling Conflict. 6th of Dec ’07

Wednesday, December 5th, 2007

Where: 57-58 De Morgan House – Russell Square. Central London.
When: 5:00pm – 7.00 pm
Who: Professor Timothy Hackworth & Philip Treleaven (Computer Science UCL)

What: Computational Science has already had an immense impact on the life sciences. We now try to assess its effectiveness on social and political modelling and, in particular, in thwarting terrorism (pdf).

Filtering spam depending on your reputation (on the amount of spam you typically receive)

Tuesday, December 4th, 2007

Abaca has recently proposed an effective way of filtering spam emails. It is called receiver reputation.

It relies on this fact
One can group receivers by the amount of spam they receive on a daily basis. Say we consider 5 groups. “People in Group 1 receive, on average, 90% spam. Group 2 receives 70% spam, Group 3 receives 50% spam, Group 4 receives 30% spam, and Group 5 receives 10% spam.”

How it works
Messages are classified whether they are spam or not depending on the receiver of the message, “rather than where the message is FROM or what it CONTAINS“. “Essentially, if the message is sent to users who typically receive a high percentage of spam, the message is more likely to be spam. However, if the message is sent to users who typically receive a low percentage of spam, the message is more likely to be legitimate. Combining the reputations of all recipients of a particular message, therefore, is equivalent to combining those users’ rating power to estimate the legitimacy of the sender and the message”

What about new users?
“The system can be bootstrapped from an empty database with just 2 users (someone who gets a lot of spam and someone who gets a lot of ham). … The system was initially seeded with just two users: a person who receives virtually all spam and a person who receives virtually all legitimate mail. The statistics of a third user was then approximated using the ratings established by the first two users. The fourth user was added with that user’s statistics approximated by the first three users, etc.”
“The amazing thing is no human is required to read or rate any email; the system gets smarter on it’s own without any human intervention”