Archive for the ‘spam’ Category


Monday, June 29th, 2009

The program of SocialCom is out. My picks:

  • Deriving Expertise Profiles From Tags (
  • Ranking Comments on the Social Web (
  • Structure of Heterogeneous Networks (
  • Online User Activities Discovery based on Time Dependent Data (
  • Evaluating the Impact of Attacks In Collaborative Tagging Environments (
  • Community Computing: Comparisons between Rural and Urban Societies using Mobile Phone Data (

Spam Dataset

Monday, January 21st, 2008

WEBSPAM-UK2007 ” is a large collection of annotated spam/nonspam hosts labeled by a group of volunteers. The base data is a set of 105,896,555 pages in 114,529 hosts in the .UK domain downloaded by the Laboratory of Web Algorithmics of the University of Milano. The assessment was done by a group of volunteers.

For the purpose of the Web Spam Challenge 2008, the labels are being released in two sets. SET1, containing roughly 2/3 of the assessed hosts will be given for training, while SET2 containing the remaining 1/3, will be held for testing. More information about the Web Spam Challenge 2008, co-located with AIRWeb 2008 will be available soon” here and here.

Redefining Information Overload

Thursday, December 27th, 2007

The other day I was sitting at Gatwick airport waiting for my flight home to Italy to spend Christmas with my family. I got my flight with Easyjet- and when I bought the ticket online I was also able to sign up to one of their new, free text-messaging services:

  • Some of the texts were very helpful: the morning of my flight I received a text with my flight details and confirmation number, information that I may usually scribble on a piece of paper or the back of my hand. Result: no paper, and clean hands (happier parents?)
  • Some of the texts could have made us of some location information: a text said (in a nice way) “go to your gate” … umm, should I reply to the computer and tell it I’m already there?
  • Other texts were interesting, but I didn’t need them: “Use this text to get 0% commission on currency exchange.” I have some Euros in my pocket. Can you send me this text again when I do need Euros? (Maybe I’ll tell you when?)
  • Other texts were just useless. “Go to shop X and get Y% discount with this text.” I won’t say what the shop is, let’s just leave it at the fact that its contents don’t quite fit my profile (specifically gender). Why do you keep interupting me from the book I was reading to give me this useless advertisement? My only current solution is to unsubscribe- but I’ll lose all the information I liked then! (more…)

Filtering spam depending on your reputation (on the amount of spam you typically receive)

Tuesday, December 4th, 2007

Abaca has recently proposed an effective way of filtering spam emails. It is called receiver reputation.

It relies on this fact
One can group receivers by the amount of spam they receive on a daily basis. Say we consider 5 groups. “People in Group 1 receive, on average, 90% spam. Group 2 receives 70% spam, Group 3 receives 50% spam, Group 4 receives 30% spam, and Group 5 receives 10% spam.”

How it works
Messages are classified whether they are spam or not depending on the receiver of the message, “rather than where the message is FROM or what it CONTAINS“. “Essentially, if the message is sent to users who typically receive a high percentage of spam, the message is more likely to be spam. However, if the message is sent to users who typically receive a low percentage of spam, the message is more likely to be legitimate. Combining the reputations of all recipients of a particular message, therefore, is equivalent to combining those users’ rating power to estimate the legitimacy of the sender and the message”

What about new users?
“The system can be bootstrapped from an empty database with just 2 users (someone who gets a lot of spam and someone who gets a lot of ham). … The system was initially seeded with just two users: a person who receives virtually all spam and a person who receives virtually all legitimate mail. The statistics of a third user was then approximated using the ratings established by the first two users. The fourth user was added with that user’s statistics approximated by the first three users, etc.”
“The amazing thing is no human is required to read or rate any email; the system gets smarter on it’s own without any human intervention”