Archive for the ‘dataset’ Category

Energy Star ratings of city government buildings

Friday, December 28th, 2012

As NYC buy cialis online measures energy performance in buildings, some interesting

the best site

results (NYT article, data)

New Data

Thursday, October 1st, 2009
  • 30 Resources to Find the Data You Need
  • New Reality Mining Data Available. From Nathan Eagle: “I am currently releasing the full Reality Mining dataset. It’s got loads of additional information – especially related to survey responses (friendships, recent illness, satisfaction, etc). The new ReadMe has a complete description. If you’d like access, just drop me an email. As I’m now involved in other projects, I haven’t had much time to look at this new data – so have at it. ”

Netflix Prize – Round 2

Monday, September 21st, 2009

The netflix prize winners have been announced, as well as the next $1 million competition. From here:

“The new challenge focuses on predicting the movie preferences of people who rarely or never rate the movies they rent. This will be deduced from more than 100 million data points, including information about renters’ ages, genders, ZIP codes, genre ratings and previously chosen movies.

Instead of a single $1 million prize, this new challenge will be split into one $500,000 award to the team judged to be leading after six months and an additional $500,000 to the team in the lead at the 18-month mark, when the contest is wrapped up.”

Interestingly, our previous discussion on the viability of the winner’s results has now an answer. From here:

The team’s 10 percent achievement will not be immediately incorporated into Netflix.com, said Neil Hunt, chief product officer.

“There are several hundred algorithms that contribute to he overall 10 percent improvement – all blended together,” Hunt said. “In order to make the computation feasible to generate the kinds of volumes of predictions that we needed for a real system – we’ve selected just a small number – two or three of those algorithms for direct implementation.”

Yahoo Datasets (Webscope)

Tuesday, May 5th, 2009

Yahoo has recently made publicly available a  huge catalog of datasets (data on ratings, language, graphs, and advertising)

Rich RDF Data for building Music RecSys (by BBC Music)

Friday, March 27th, 2009

The new BBC Music website was launched yesterday – a lot of RDF data. For example:

More in this post. Up to  build a recommender systems from this data?

Similarity Graphs

Thursday, February 26th, 2009

The idea of reasoning about content to recommend as a similarity graph is quite widespread. Broadly speaking, you can start by drawing a set of circles (for users) on the left and a set of circles (for “items” – songs, movies..) on the right; when users rate/listen to/etc items, you draw an arrow from the corresponding left circle to the right circle (i.e. a bipartite graph).  What collaborative filtering algorithms can do is project the two-sided graph to two equivalent representations, where users are linked to other users, and items are linked to other items based on how similar they are.

There are a bunch of places where this kind of abstraction has been used; for example, Oscar Celma used graphs to navigate users when discovering music in the long-tail. Paul Lamere posted graphs made with the EchoNest API on his blog. I’ve also dabbled in this area a bit, but not using music listening data; I was using (the more traditional) MovieLens and Netflix datasets. The question that comes to mind when reading about techniques that operate on the graph, though, is: are the underlying graphs real representations of similarity between content? What if the graphs are wrong? (more…)

Get Facebook data

Tuesday, February 10th, 2009

by Alvin Chin: I believe the best way to get at Facebook data is to take a subset, get consent from people within a Facebook group. …

(more…)

Ubicomp 2008

Monday, September 29th, 2008

Many blogs have been covering Ubicomp and, a couple of days ago, I promised to write down my own coverage. Here you go ;-)

The first day I attended the Automated Journeys workshop organized by Arianna Bassoli (who gave a talk at UCL a while back), Johanna Brewer (whose recent work has been covered here; for more, check her blog), and Alex Taylor. The workshop’s format was not  traditional. As part of the workshop, we went out and had lunch :-), and, while doing so, we observed how people in Seoul use technologies.  Then, we came back and, through group discussions and hands-on design brainstorming sessions, we produced  4 envisagements that  critically reflected on technological futures. It was very engaging! I hope other workshops will replicate/mutate this format. I wished I could attend at least two of the  other workshops on offer: Ubiquitous Systems Evaluation partly organized by Chris Kray (I am in debt with him, and he knows why ;-)) and Devices that Alter Perception partly organized by Carson Reynolds.

At Ubicomp, the speakers did not suffer from powerpoint karaoke syndrome, and their slides were generally  well-designed – less text, more images. That is largely because the ubicomp’s community is made of design-conscious (CHI) researchers. Few talks are already available on slideshare.

Here are few papers I personally found intriguing because of their algorithms, their evaluation, or their interesting ideas. At the end of this post, I’ll point to few datasets that have been used and can be of interest ;-)

1. Algorithms

Navigate Like a Cabbie: Probabilistic Reasoning from Observed Context-Aware Behavior. Brian D. Ziebart showed a new way of making route predictions. He used a probabilistic model  presented at AAAI “Maximum Entropy Inverse Reinforcement Learning“.  Interestingly, he showed that the model works upon data that is noisy and imperfect.

Pedestrian Localisation for Indoor Environments. Oliver Woodman proposed a way of  tracking people indoor. Oliver and Robert showed how to combine a foot-mounted unit, a building model, and a particle filter to track people in a building. They experimentally showed that users can be effectively tracked within 1m without knowing their initial positions. Great results! It’s a paper well worth reading!

Discovery of Activity Patterns using Topic Models. Bernt Schiele presented a new method for recognizing a person’s activities from wearable sensors.  This method adapts probabilistic topic models and has been shown to recognize daily routines without user annotation.  One of Bernt’s students had an interesting poster on detecting location transitition using sensor data (pdf).

2. Evaluation

A couple of papers (including the great work done by Matthew Lee)  used a method called the Wizard of Oz evaluation. The general idea is to simulate those parts of the system (e.g., speech recognition) that require most effort in terms of development or to assess the suitability of your interface(see “Wizard of Oz studies – why and how” (pdf) for more).

Flowers or a Robot Army? Encouraging Awareness & Activity with Personal, Mobile Displays by Sunny Consolvo et al.  They designed a system that makes it possible for mobile users to self-monitor their physical activities and conducted a greatly designed 3-month field experiment.

Reflecting on the Invisible: Understanding End-User Perceptions of Ubiquitous Computing (pdf). Erika Shehan Poole detailed end-user perceptions of RFID technology using an interesting qualitative method that combines structured interviews and photo elicitation excercises. Erika and her mates show that, by using this method, one is able to uncover perceptions that are often difficult for study participants to verbalize.  One of her findings: many people believed that RFID can be used to remotely tract the location of tagged objects, people, or animals!

3. Interesting Ideas

Bookisheet: Bendable Device for Browsing Content Using the Metaphor of Leafing Through the Pages. Trash your mouse. Jun-ichiro Watanabe presented a VERY promising interface (a book made of two thin plastic sheets and bend sensors) with which  a user can easily scroll digital content such as photos. The user  does so by simply bending one side of the sheet or the other.

Towards the Automated Social Analysis of Situated Speech Data. To automatically understand individual and group behavior, Danny Wyatt et al. recorded the coversational dynamics of 24 people over 6 months. They did so using privacy-sensitive techniques. By using this type of studies, researchers may well  gain broad sociological insights.

The Potential for Location-Aware Power ManagementRobert Harle showed how to dinamically optimize the energy consumption of an office. Very interesting problem-driven research!


Accessible Contextual Information for Urban Orientation
. Jason Stewart  presented a prototype of a location-based  service with which mobile users share content (see their project’s website)

Enhanced Shopping: A Dynamic Map in a Retail Store.  Alexander Meschtscherjakov  presented a prototype for mobile phones that displays  customer activities (e.g., customer flow) inside a shopping mall

Spyn: Augmenting Knitting to Support Storytelling and Reflection (pdf). Daniela K. Rosner‘s presentation was masterfully designed! She walked us through her expirience of designing Spyn – a system for knitters to record, playback, and share information involved in the creation of their hand-knit artifacts. She showed how her system enriches the knitter’s craft

Picture This! Film assembly using toy gestures. Cati Vaucelle (who keeps a cool blog) presented a new input device embedded in children’s toys for video composition.  As they play with the toys to act out a story, children conduct film assembly.

4. Datasets

Understanding Mobility Based on GPS Data by Yu Zheng et al. used GPS logs of 65 people over 10 months (the largest dataset in the community!) to evaluate a new way of  inferring people’s motion modes from their GPS logs

Accurate Activity Recognition in a Home Setting (pdf) by Tim van Kasteren et al. used 28 days of sensor data about one person @ home and corresponding annotations of his activities (e.g., toileting, showering, etc.) to evaluate a new method for recognizing activities from sensor data.

Discovery of Activity Patterns using Topic Models by Tam Huynh et al. used 16 days of sensor data from a man who was carrying  2 wearable sensors to test their method for automatically recognizing activities (e.g., dinner, commuting, lunch, office work) from sensor data.

On Using Existing Time-Use Study Data for Ubiquitous Compting Applications by Kurt Partridge and Philippe Golle how to use data (e.g. people’s activities and locations) that has been collected by governments and commercial institutions to evaluate ubicomp systems.

The Potential for Location-Aware Power Management by Rober Harletested on location data of 40 people in 50-room office building for 60 working days his proposed strategies for dinamically optimizing the energy consumption of an office.

(ubicomp2008)

Flickr Places

Friday, August 29th, 2008


Flickr Places “is a method of exploring Flickr with geo-specific pages. The page shows the most interesting photos for a location (iconic photos they call them), the most recent and common tags for the photos and the most prolific photo groups. It creates a separate page for each geographic location with a unique human-readable URL. Places go down to the city level so San Francisco, Seattle, and London will each have their own page and unique URL. In time they will go deeper. Places will be accessible via the Flickr API.” More here and here. From this project, data useful for evaluation could come out !

Dataset and R code for our paper on genres/artists affinity

Tuesday, August 19th, 2008

Justin Donaldson and I have a paper at ISMIR entitled “Uncovering affinity of artists to multiple genres from social behaviour data”. The paper details a project we worked on for the past year or so involving popular music listening activity from a pool of MusicStrands users.

We provide not only the paper, but also the dataset and the code used in our analysis. All of this is available at the website we have set up for the project: http://labs.strands.com/music/affinity/

The main contribution of the project is an analysis and illustration of genres as “fuzzy sets” rather than boolean labels. Through a co-occurence analysis of hundreds of thousands of user playlists, a frequency based “affinity” metric is formed between artists and genres. This affinity metric is a more detailed expression of the style of a given artist’s music. The idea and awareness of predominant genres are a trivial part of any person’s understanding of the vast corpus of popular music. However, genres typically are used as Boolean categorical labels. I.e. an artist is understood to be associated with only one given genre.

By expressing a connection to multiple genres through our affinity metric, a more detailed picture of the artist emerges. We give a lot more examples in the website, so be sure to check it out. - http://labs.strands.com/music/affinity/

 

Claudio Baccigalupo

Evaluating Mobile Solutions – WWW’08 to the rescue

Tuesday, July 8th, 2008

To evaluate new mobile content discovery approaches, one needs to understand:

1) What mobile users query for:

2) How interests distribute across mobile users (who befriend each other):

Location-enabled Mobile Browser

Tuesday, June 3rd, 2008

Christian Becker‘s presentation about DBpedia mobile (pdf): a location-enabled linked data browser for mobile devices, giving you nearby sights and detailed descriptions, restaurants, hotels, etc. We chatted a bit after the workshop with Alexandre and Christian about adding Last.fm events to the DBtune exporter to also display nearby gigs (with optional filtering based on your foaf:interests, of course :-) )”

ECML PKDD Discovery Challenge 2008

Friday, May 9th, 2008

The ECML PKDD Discovery Challege has been announced. It includes two sub-challenges, both related to the social bookmarking system Bibsonomy:

  1. Spam Detection in Social Bookmarking Systems
  2. Tag Recommendation in Social Bookmarking Systems

See the link for all the details!

WorldWide Buzz

Wednesday, April 2nd, 2008

A new technical report, written while the author was an intern at Microsoft, analyses “the largest social network analyzed to date.” Here is the abstract:

We present a study of anonymized data capturing a month of high-level communication activities within the whole of the Microsoft Messenger instant-messaging system. We examine characteristics and patterns that emerge from the collective dynamics of large numbers of people, rather than the actions and characteristics of individuals. The dataset contains summary properties of 30 billion conversations among 240 million people. From the data, we construct a communication graph with 180 million nodes and 1.3 billion undirected edges, creating the largest social network constructed and analyzed to date. We report on multiple aspects of the dataset and synthesized graph. We find that the graph is well-connected and robust to node removal. We investigate on a planetary-scale the oft-cited report that people are separated by “six degrees of separation” and find that the average path length among Messenger users is 6.6. We also find that people tend to communicate more with each other when they have similar age, language, and location, and that cross-gender conversations are both more frequent and of longer duration than conversations with the same gender.

Trustlet

Friday, March 28th, 2008

When browsing around the blogs I read, I came across trustlet: a wiki site dedicated to sharing scientific research on trust metrics in social networks. It includes an excellent list of conferences/workshops that deal with trust, and an incredible list of links to datasets – ranging from wikipedia, email networks, and blog networks, to the much sought advogato and epinions datasets. Much more than we have done on our own dataset page!

What a gold mine!