Spam Dataset

WEBSPAM-UK2007 ” is a large collection of annotated spam/nonspam hosts labeled by a group of volunteers. The base data is a set of 105,896,555 pages in 114,529 hosts in the .UK domain downloaded by the Laboratory of Web Algorithmics of the University of Milano. The assessment was done by a group of volunteers.

For the purpose of the Web Spam Challenge 2008, the labels are being released in two sets. SET1, containing roughly 2/3 of the assessed hosts will be given for training, while SET2 containing the remaining 1/3, will be held for testing. More information about the Web Spam Challenge 2008, co-located with AIRWeb 2008 will be available soon” here and here.


Comments are closed.