Archive for the ‘paper’ Category

WorldWide Buzz

Wednesday, April 2nd, 2008

A new , written while the author was an intern at Microsoft, analyses “the largest social network analyzed to date.” Here is the abstract:

We present a study of anonymized data capturing a month of high-level communication activities within the whole of the Microsoft Messenger instant-messaging system. We examine characteristics and patterns that emerge from the collective dynamics of large numbers of people, rather than the actions and characteristics of individuals. The dataset contains summary properties of 30 billion conversations among 240 million people. From the data, we construct a communication graph with 180 million nodes and 1.3 billion undirected edges, creating the largest social network constructed and analyzed to date. We report on multiple aspects of the dataset and synthesized graph. We find that the graph is well-connected and robust to node removal. We investigate on a planetary-scale the oft-cited report that people are separated by “six degrees of separation” and find that the average path length among Messenger users is 6.6. We also find that people tend to communicate more with each other when they have similar age, language, and location, and that cross-gender conversations are both more frequent and of longer duration than conversations with the same gender.

YouTube Behavior

Wednesday, April 2nd, 2008

I read an interesting paper, “Identifying User Behavior in Online Social Networks,” by Marcelo Maia, Jussara Almeida and Virgílio Almeida. It was presented yesterday at the First International Workshop on Social Networks (co-located with EuroSys 2008).

The paper uses an interesting dataset: a social network based on the user-subscriptions on youtube. In other words, if I subscribe to your video uploads, then I link to you in the network. Here is a very brief summary: How is it possible to classify users according to different behaviors? An answer to this question would help specialists design their sites according to the target audience; however, trying to identify groups of similarly-behaved users based on individual attributes does not produce useful results. So what can be done? More informative traits can be used: the social interaction attributes. For example, consider the subscription network of youtube: considering each user’s in-degree (people who subscribe to that user’s content), out-degree (number of subscriptions), and reciprocity (mutual subscriptions), as well as number of uploads, watches, and channel views, allows for user behavior to be classified into five groups. The three main groups that appear are the content producers, consumers, and mixed producer/consumers. The last two are the old-possibly inactive users and those who small-degree/high clustering coefficient (the cliques).

The users in this dataset were classified using k-means, which typically relies on a pre-defined value of k to work. Another interesting contribution is a method that finds what k to use, based on balancing the proportion of inter- and intra- cluster distance properly (details in the paper). Of course, just like needing to specify k, a more general weakness of these techniques seems to be that you need to know what you are looking for before you can find any structure. In other words, if the authors had decided to cluster based on different social-interaction attributes (or social-net graph properties), maybe their results would have been remarkably different?

There are a lot of other interesting papers that use youtube datasets, including this one that looks at how content popularity on the site fluctuates.

Algorithm finds the network – for genes or the Internet

Thursday, March 20th, 2008

Maybe someone is interested:

“In a recent paper in Physical Review E 77:016104 (2008), Weixiong Zhang, Ph.D., Washington University associate professor of computer science and engineering and of genetics, along with his Ph.D. student, Jianhua Ruan, published an algorithm to automatically identify communities and their subtle structures in various networks.”
(news article).

Franco

Filtering By Trust

Thursday, February 28th, 2008

A lot of the posts on this blog discuss the idea of computers using trust in different scenarios. Since starting as a research student, and reading a lot of good papers about trust, I still find the idea slightly strange: is trust something that can actually be quantified? Are these computers actually reasoning about trust? (what is trust?)

Trust, however, seems to have a place in our research. It gives us a metaphor to describe how interactions between “strangers” should happen, a platform to build applications that feed off these ideas, and a language to describe and analyse the emergent properties that we observe in these systems. So, even if a trust relationship may be a concept that cannot be boiled down to a deterministic protocol, it gives researchers a model of the world they are exploring.

I decided to apply this metaphor to collaborative filtering, an algorithm that aims to generate recommendations by assigning neighbours to each user. Usually this assignment is done by computed similarity, which has its downfalls: how similar are two people who have never rated items in common? What is the best way of measuring similarity? Applying the metaphor of trust, instead, aims at capturing the value that each neighbour gives to a user’s recommendations over time, and value is not only derived from agreement- but also from the ability a neighbour has to provide information. While similarity-based algorithms use neighbour ratings, interactions based on trust acknowledge that information received from others is subject to interpretation, as there may be a strong semantic distance between the way two users apply ratings to items.

Trust is a richer concept than similarity by favouring neighbours who can give you information about the items you seek and offering a means of learning to interpret the information received from them. Evaluating this technique improves on the basic similarity-driven algorithms, both in terms of accuracy and coverage: modeling the problem as an instance of a trust-management scenario seems to offer an escape from traditional pitfalls of similarity-based approaches. I leave all the details to the paper, “Trust-based Collaborative Filtering,” which is due to appear in IFIPTM 2008 (thanks to Daniele for his great comments and feedback).

Benefits of Social Groups

Tuesday, January 22nd, 2008

There is an interesting article in Nature about group formation: the interactions between predators and prey (which was usually based on random mixing) is drastically changed when the predators and prey start forming social groups. Prey benefit from grouping by reducing the chances that predators will come across them, and predators benefit by being able to attack in numbers when they do encounter prey. Naturally, the question to ask is to what extent this could be true in other scenarios- (forming trust communities?) and can stability arise from being super-selective about who we interact with (thanks to Steve for the article).

ICDM 2007

Wednesday, December 5th, 2007

I attended ICDM (a data mining conference) this year. Since I cannot comment on all the papers I’ve found interesting, here is the full program and my comments on very few papers follow ;-)

6 Full papers

1) Improving Text Classification by Using Encyclopedia Knowledge

Existing methods for classifying text do not work well. That is partly because there are many terms that are (semantically) related but do not co-occur in the same documents. To capture the relationships among those terms, one should use a thesaurus. Pu Wand et al. built a huge thesaurus from Wikipedia and showed that classification benefits from its use.

2) Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights
The most common form of collaborative filtering consists of three major steps:
(1) data normalization, (2) neighbour selection, and (3) determination of interpolation weights. Bell and Koren showed that different ways of carrying out the 2nd step do not impact the predictive accuracy. They then revisited the remaining two steps – they revisited:
+ the 1st step by removing 10 “global effects” that cause substantial data variability and mask fundamental relationships between ratings,
+ and the 3rd step by computing interpolation weights as a global solution to an optimization problem.
By using these revisions, they considerably improved predictive accuracy, so much so that they won the Netflix Progress Prize.

3) Lightweight Distributed Trust Propagation
Soon individuals will be able to share digital content (eg, photos, videos) using their portable devices in a fully distributed way (without relying on any server). We presented a way with which portable devices distributely select content from reputable sources (as opposed to previous work that focuses on centralized solutions).

4) Analyzing and Detecting Review Spam
Here Jindal and Liu proposed an effective way for detecting spam of product reviews.

5) Co-Ranking Authors and Documents in a Heterogeneous Network
Existing ways of ranking network nodes (eg, PageRank) work on homogeneous networks (networks whose nodes represent the same kind of entity, eg, nodes of a citation network usually represent publications). But most networks are heterogeneous (eg, a citation network may well have nodes that are either publications or authors). To rank nodes of heterogeneous networks, Zhou et al. proposed a way that couples random walks. In a citation network, this translates into two random walks that separately rank authors and publications (rankings of publications and their authors depend on each other in a mutually reinforcing way).

6) Temporal analysis of semantic graphs using ASALSAN
Say we have a large dataset of emails that employees of a company (eg, of Enron) have exchanged. To make sense of that dataset, we may represent it as a (person x person) matrix and decompose that matrix to learn latent features. Decompositions (eg, SVD) usually work on a two-dimensional matrix. But say that we also know WHEN emails have been sent. That is, we have a three-dimensional matrix – (person x person x time) matrix. Bader et al. showed how to decompose 3-dimensional matrices.

1 Short paper
1) Trend Motif: A Graph Mining Approach for Analysis of Dynamic Complex Networks
Jin et al. proposed a way of mining complex networks whose edges have weights that change over time. More specifically, they extract temporal trends – trends of how weights change over time.

2 Workshop Papers
1) Aspect Summarization from Blogsphere for Social Study
Researchers have been able to classify sentiments of blog posts (eg, whether posts contain positive or negative reviews). Chang and Tsai built a system that marks a step forward – the ability to extract opinions from blog posts. In their evaluation, they showed how their system is able to extract pro and con arguments about abortion and gay marriage from real blog posts.

2) SOPS: Stock Prediction using Web Sentiment
To predict stock values, traditional solutions solely rely on past stock performance. To make more informed predictions, Sehgal and Song built a system that scans financial message boards, extracts sentiments expressed in them, and then learns the correlation between sentiment and stock performance. Upon what it learns, the system makes predictions that are more accurate than those of traditional methods.

3 Invited Talks (here are some excerpts from the speakers’ abstracts)
1) Bricolage: Data at Play (pdf)
There are a number of recently created websites (eg, Swivel, Many Eyes, Data 360) that enable people to collaboratively post, visualize, curate and discuss data. These sites “take the premise that communal, free-form play with data can bring to the surface new ideas, new connections, and new kinds of social discourse and understanding. Joseph M. Hellerstein focused on opportunities for data mining technologies to facilitate, inspire, and take advantage of communal play with data.

2) Learning from Society (htm)
Ian Witten illustrated how learning from society in the Web 2.0 world will provide some of the value-added functions of the librarians who have traditionally connected users with the information they need. He also stressed the importance of designing functions that are “open” to the world in contract to the unfathomable “black box” that conceals the inner workings of today’s search engines.

3) Tensor Decompositions and Data Mining (pdf)
Matrix decompositions (such as SVD) are useful (eg, for ranking web pages, for recognizing faces) but are restricted to two-way tabular data. In many cases, it is more natural to arrange data into an N-way hyperrectangle and decompose it by using what is called a tensor decomposition. Tamara Kolda discussed several examples of tensor decompositions being used for hyperlink analysis for web search, computer vision, bibliometric analysis, cross-language document clustering, and dynamic network traffic analysis.

Whatsoever u say, i trust u because u r my friend

Thursday, November 29th, 2007

Say that in the near future you will be able to post a question on social network sites. You will get many different answers (whose quality may vary). It would be nice if you could get a list of answers ranked by quality.

Problem: how to rank answers? That’s a cool problem that is disappointingly hard to solve.

Some would argue for using social networks – people close to you should be trusted and they can surely answer any type of question. For me, that’s hard to believe. That is probably because my friends have diverging interests (a few know about technology or finance, a handful of them knows about literature, and many about architecture and design – I have a weakness for creative people). And they also think differently – I’m fortunate enough to have few friends who are left-brainers, while most of them go for the “right” side. If many people would go along with my personal experience, then we could deem those solutions oversimplified at best.

So how to go about the initial problem? For a start, I would acknowlege that:
a) “X befriends Y” and “X trusts Y” are two totally different concepts. I’m overemphasizing by saying “totally”: after all, there may be a correlation between the two concepts; but it is difficult to buy into the causation arrow “I befriend you” -> “I trust you”. Therefore, that distinction may be important (if not crucial), and we usually underemphasize it “for simplicity’s sake”.
b) “X trusts Y” has little meaning. Since trust is context dependent (1, 2, 3), one needs to specify for what X trusts Y. X may trust Y for academic tips but not for real-world issues ;-). So a better way could be “X trusts Y for doing Z”, and that Z would be crucial.

Given these two points, I really like this recent paper. The authors separate social networks and webs of trust (which they call vote-on networks), and they are planning to build around context-specific webs of trust. Great work!

Measurement and Analysis of Online Social Networks

Wednesday, November 14th, 2007

At IMC, it has been presented the first study to examine multiple online social networks at scale. The paper analyzes “data gathered from four popular online social networks: Flickr, YouTube, LiveJournal, and Orkut”.

Findings

  • “the indegree of user nodes tends to match the outdegree;
  • the networks contain a densely connected core of high-degree nodes;
  • this core links small groups of strongly clustered, low-degree nodes at the fringes of the network”.

Implications on info dissemination and search

  • “The existence of a small, well-connected core implies that information seeded via a core node will rapidly spread through the entire network.”
  • “Similarly, searches that proceed along social network links will quickly reach the core. This suggests that simple unstructured search algorithms could be designed if the core users were to store some state about other users.”

Implications on trust
“In a social network, the underlying user graph can potentially be used as a means to infer some level of trust in an unknown user, to check the validity of a public key certificate, and to classify potential spam”.

  • “The tight core coupled with link reciprocity implies that users in the core appear on a large number of short paths. Thus, if malicious users are able to penetrate the core, they can skew many trust paths (or appear highly trustworthy to a large fraction of the network).”
  • “However, these two properties also lead to small path lengths and many disjoint paths, so the trust inference algorithms should be adjusted to account for this observation. In particular, given our data, an unknown user should be highly trusted only if multiple short disjoint paths to the user can be discovered.”
  • “The correlation in link degrees implies that users in the fringe will not be highly trusted unless they form direct links to other users. The “social” aspect of these networks is selfreinforcing: in order to be trusted, one must make many “friends”, and create many links that will slowly pull the user into the core.”

A World Wide Web Without Walls

Tuesday, November 13th, 2007

This paper by MIT folks puts forward W5 – A World Wide Web Without Walls: “This paper imagines a very different Web ecosystem, in which users retain control of their data and developers can justify their existence without hoarding that data”

Compare that to what Ross Anderson wrote in 2002: “Information security is about money and power; it’s about who gets to read, write, or run which file. The economics of information goods and services markets is dominated by the establishment and defence of monopolies, the manipulation of switching costs, and the control of interoperability.”

Given the sad reality, W5 will have to rely not only on sound tech design but also on good luck.

To be Levy, or not to be: that is the question

Tuesday, November 13th, 2007

Last month, Science published this paper: “Our results question the strength of the empirical evidence for biological Levy flights.”

People have been developing theoretical models assuming Levy flights for twenty years now.

But, no worries, seems to suggest this paper currently discussed at HotNets: “We conduct a statistical study of human mobility… Our data reveals statistical features similar to those in what physicists have long called Levy random walks (or Levy walks).”

Physicists have long called them so, but they have apparently just stopped. How much more confusing can it get?

P.S. In the Science paper, consider also the allure of Box 1 exotically titled “When is a power law not a power law?”

Lightweight Distributed Trust Propagation

Wednesday, October 31st, 2007

I just finished to present our work at ICDM. Here are the slides (also in ppt).

Follow the trend or make a difference

Wednesday, October 3rd, 2007

Fang Wu and Bernardo Huberman have just published a very interesting paper in which they analyzed the temporal evolution of opinions about news and products posted on the Internet. (more…)

Collaborative Filtering is Strange

Friday, September 21st, 2007

I just submitted a paper that includes some very strange results that I got when playing around with the different collaborative filtering techniques on the MovieLens dataset. The work was a direct follow up to the new similarity measure I wrote about in my previous post on privacy in distributed recommender systems, and begins by reshaping the way we think about (and try to visualise) collaborative filtering. (more…)

Poolcasting

Tuesday, September 18th, 2007

Claudio (from IIIA and sponsored by MyStrands) gave a very interesting talk about poolcasting (pdf of his slides). Poolcasting is a web radio in which individuals may join different channels. Those subscribed to a channel will listen to the same stream of songs. The problem is how to select the songs on that stream. Claudio did so by combining the preferences of a channel´s listeners using Case-Based Reasoning.

The same approach may be used for mobile music. A bar may decide to play songs depending on the preferences of its customers (and preferences may be elicited from the playlists that customers store on their mp3 players or mobile phones).

A couple of questions that might be of interest to some of us: what if listeners do not share many songs in their playlists? Would it be possible to factor listeners´ reputation (trust) in deciding which songs to play?

Trust Bootstrapping: TRULLO @ Mobiquitous

Thursday, September 6th, 2007

At Mobiquitous 2007, we presented TRULLO. Here are the slides (some animations do not work properly in slideshare, sorry). A brief description follows.

Situation: Using mobile devices, such as smart phones, people may create and distribute different types of digital content (e.g., photos, videos). One of the problems is that digital content, being easy to create and replicate, may likely swamp users rather than informing them. To avoid that, users may run trust models on their mobile devices. A trust model is a piece of software that keeps track of who provides quality content and who does not.

Problem: Devices should be able to set their initial trust for other devices. One way of doing so is for devices to learn from their own past experiences. To see how, consider the following quotes about human trust: “We may initially trust or not trust those involved on our projects based on past experience”, and “If your boyfriend is unfaithful, you won’t initially trust the next man you date” :-) Algorithms that model human trust on pervasive devices, one might say, ought to do the same thing – they should assign their initial trust upon `similar’ past experiences.

Existing Solutions: Existing solutions usually require an ontology upon which they decide which past experiences are similar, and, in so doing, they require both that the same ontology is shared by all users (which is hardly the case in reality) and that  users agree on that ontology for good (ie, the ontology is not supposed to change over time)  :-(

Proposal: TRULLO gathers ratings of past experiences in a matrix, learns staticial “features” from that matrix, and combines those features to set initial trust values. It works quite well in a simulated antique market and its implementation is reasonably fast on a Nokia mobile phone.

Future: TRULLO does not work if one does not have past experiences. That is why we will propose a distributed trust propadation algorithm (pdf).