ICDM 2007

I attended ICDM (a data mining conference) this year. Since I cannot comment on all the papers I’ve found interesting, here is the full program and my comments on very few papers follow ;-)

6 Full papers

1) Improving Text Classification by Using Encyclopedia Knowledge

Existing methods for classifying text do not work well. That is partly because there are many terms that are (semantically) related but do not co-occur in the same documents. To capture the relationships among those terms, one should use a thesaurus. Pu Wand et al. built a huge thesaurus from Wikipedia and showed that classification benefits from its use.

2) Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights
The most common form of collaborative filtering consists of three major steps:
(1) data normalization, (2) neighbour selection, and (3) determination of interpolation weights. Bell and Koren showed that different ways of carrying out the 2nd step do not impact the predictive accuracy. They then revisited the remaining two steps – they revisited:
+ the 1st step by removing 10 “global effects” that cause substantial data variability and mask fundamental relationships between ratings,
+ and the 3rd step by computing interpolation weights as a global solution to an optimization problem.
By using these revisions, they considerably improved predictive accuracy, so much so that they won the Netflix Progress Prize.

3) Lightweight Distributed Trust Propagation
Soon individuals will be able to share digital content (eg, photos, videos) using their portable devices in a fully distributed way (without relying on any server). We presented a way with which portable devices distributely select content from reputable sources (as opposed to previous work that focuses on centralized solutions).

4) Analyzing and Detecting Review Spam
Here Jindal and Liu proposed an effective way for detecting spam of product reviews.

5) Co-Ranking Authors and Documents in a Heterogeneous Network
Existing ways of ranking network nodes (eg, PageRank) work on homogeneous networks (networks whose nodes represent the same kind of entity, eg, nodes of a citation network usually represent publications). But most networks are heterogeneous (eg, a citation network may well have nodes that are either publications or authors). To rank nodes of heterogeneous networks, Zhou et al. proposed a way that couples random walks. In a citation network, this translates into two random walks that separately rank authors and publications (rankings of publications and their authors depend on each other in a mutually reinforcing way).

6) Temporal analysis of semantic graphs using ASALSAN
Say we have a large dataset of emails that employees of a company (eg, of Enron) have exchanged. To make sense of that dataset, we may represent it as a (person x person) matrix and decompose that matrix to learn latent features. Decompositions (eg, SVD) usually work on a two-dimensional matrix. But say that we also know WHEN emails have been sent. That is, we have a three-dimensional matrix – (person x person x time) matrix. Bader et al. showed how to decompose 3-dimensional matrices.

1 Short paper
1) Trend Motif: A Graph Mining Approach for Analysis of Dynamic Complex Networks
Jin et al. proposed a way of mining complex networks whose edges have weights that change over time. More specifically, they extract temporal trends – trends of how weights change over time.

2 Workshop Papers
1) Aspect Summarization from Blogsphere for Social Study
Researchers have been able to classify sentiments of blog posts (eg, whether posts contain positive or negative reviews). Chang and Tsai built a system that marks a step forward – the ability to extract opinions from blog posts. In their evaluation, they showed how their system is able to extract pro and con arguments about abortion and gay marriage from real blog posts.

2) SOPS: Stock Prediction using Web Sentiment
To predict stock values, traditional solutions solely rely on past stock performance. To make more informed predictions, Sehgal and Song built a system that scans financial message boards, extracts sentiments expressed in them, and then learns the correlation between sentiment and stock performance. Upon what it learns, the system makes predictions that are more accurate than those of traditional methods.

3 Invited Talks (here are some excerpts from the speakers’ abstracts)
1) Bricolage: Data at Play (pdf)
There are a number of recently created websites (eg, Swivel, Many Eyes, Data 360) that enable people to collaboratively post, visualize, curate and discuss data. These sites “take the premise that communal, free-form play with data can bring to the surface new ideas, new connections, and new kinds of social discourse and understanding. Joseph M. Hellerstein focused on opportunities for data mining technologies to facilitate, inspire, and take advantage of communal play with data.

2) Learning from Society (htm)
Ian Witten illustrated how learning from society in the Web 2.0 world will provide some of the value-added functions of the librarians who have traditionally connected users with the information they need. He also stressed the importance of designing functions that are “open” to the world in contract to the unfathomable “black box” that conceals the inner workings of today’s search engines.

3) Tensor Decompositions and Data Mining (pdf)
Matrix decompositions (such as SVD) are useful (eg, for ranking web pages, for recognizing faces) but are restricted to two-way tabular data. In many cases, it is more natural to arrange data into an N-way hyperrectangle and decompose it by using what is called a tensor decomposition. Tamara Kolda discussed several examples of tensor decompositions being used for hyperlink analysis for web search, computer vision, bibliometric analysis, cross-language document clustering, and dynamic network traffic analysis.

Comments are closed.