- MovieLens. There are currently two datasets available from http://www.grouplens.org. The first one consists of 100,000 ratings for 1682 movies by 943 users, subdivided into five disjoint subsets. The second one consists of approximately 1 million ratings for 3900 movies by 6040 users. These are the “standard” datasets that many recommendation system papers use in their evaluation.
- Jester. This dataset contains 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users; therefore it differentiates itself from other datasets by having a much smaller number of rateable items.
- Netflix Prize. Netflix released an anonymised version of their movie rating dataset; it consists of 100 million ratings, done by 480,000 users who have rated between 1 and all of the 17,770 movies.
- Book-Crossing dataset. This dataset is from the Book-Crossing community, and contains 278,858 users providing 1,149,780 ratings about 271,379 books.
- Last.fm by Oscal Celma. it contains <user, artist, plays>
As far as I have seen, these datasets can be freely distributed, though the people who collected the data would like to be notified if you use them for your research. See their pages for further information. These are datasets that Neal has come across during his work on recommendation systems.
Webs of Trust:
- Advogato. Advogato is a community discussion board for free software developers. Using the Advogato’s trust metric, each user has a single (global) trust value computed by composing their users ratings. There are three possible ratings: apprentice, journeyer, and master. Global trust is used to control access to the discussion board: ‘apprentices’ can only post comments, whereas ‘journeyers’ and ‘masters’ are able to post both stories and comments. Daniele has used the Advogato dataset for distributed trust propagation (pdf).
- This page provides a summary of some general statistics about skitter data collected from CAIDA‘s macroscopic topology project. As the project pagestates, the gathered data:
- characterize macroscopic connectivity and performance of the Internet
- allow various topological and geographical representations at multiple levels of aggregation granularity
- provide a valuable input for empirically-based modelling of the Internet behavior and properties
- CRAWDAD: A Community Resource for Archiving Wireless Data At Dartmouth
Others (external links):
- Data for Data Mining. Links to data about: Hunt and Kill Terrorists; Geology; Astronomy; Motion Capture; Biomedical; e-Commerce (Amazon web services); Financial; Meteorological; Climate and environmental data; Motion (WhaleNet); Blog; OCR; Collaborative Filtering (Netflix); Text and XML (Yahoo’s research data sets; Enron Email DataSet; Reuter Corpora; Feedster; Blogdigger; Pubsub); Web Graphs; Sounds; Voice; Web 2.0.
- Data for Machine Learning (@ UCI). Links to data about: [to be completed]
- Stated User Opinions: This paperstudied the temporal evolution of
- the online reviews of the 48,000 best selling books at Amazon (in which “a user observes the average rating of a book when she visits a book page (usually shown at the top, right under the title). If she decides to review a book, she is required to write a short paragraph of review in addition to a simple star rating.”)
- thousands of political resolutions voted on Essembly ( “a website that lets its users post and vote on political resolves by selecting one of the four choices: “agree”, “lean agree”, “lean against”, and “against”. A user does not see the voting results until she submits her own vote. When a user posts a new resolve, she is required to vote on it.”)
- many arbitrary opinions offered for voting on Jyte (“a website that allows its users to make any claim they wish and let the community vote on it at no cost. Each claim is ranked by a positive button and a negative button and the numbers of total positive and negative votes are shown on the face of the buttons. Each user sees the numbers, makes up her mind, and submits her vote by clicking on one of the two buttons.”)
- SIENA: Social nets datasets.
- Web Spam: datasets.
- TrustLet: a number of datasets for trust and social network research.
- 6 Influential Datasets That Changed the Way We Think.
- Reuters dataset & 20newsgroups – text categorization
- SELECTLab Data (CMU) – sensor data.
- Quality of Web Service (QWS) data. 2,507 Web services and their QWS measurements.
- getting theinfo: maps, poltics, music
- Gelman’s book: speed dating, …
- CAOS @CMU: social networks.
- Luis von offensive words, en-esp lexicon,esp game
- Harvard social science data & Dataverse
- MP votes
- Network Datasets on Digg:
- World Bank API: http://developer.worldbank.org/
- Data.gov is Live: Access US Federal Data
- Yahoo Datasets (webscope)
- Data Dumps
- Every Corporation listed on U.S. Stock exchange is required to file annual reports with the SEC. These reports are called “10-K forms”. The CorpWatch API uses automated parsers to extract the subsidiary relationship information from Exhibit 21 of companies’ 10-K filings with the SEC and provides a free, well-structured interface for programs to query and process the data.
- ZipPostalCodes.com allows quickly and easy to get the current and verified zip postal codes with latitude and longitude coordinates from all around the world (UK Post Code Distance Calculator using PHP/MySQL).
- data fetching capabilitites (yql tables) that make it easier to get data from usgov and kiva (the micro-lender)
- flickr tagged pics (with location info?) http://press.liacs.nl/mirflickr/
- geonames provides a free localization web service (python interface).
- 30 Resources to Find the Data You Need
- Data for countries: world bank |
- Data for cities: data.gov | datasf.org | openplans.org | open311.org |socialcompact.org | app-contest-for-san-francisco |
- NYC data (description). NYC Building Energy consumption.
- Twitter: haiti earthquake |
- Microsoft Learning to Rank Datasets – thousands of
- hilary mason dataset suggestions
- Amazon Public datasets
- time series dataset
- all guardian data log