Last week a paper from University of Texas titled How to break anonymity of the Netflix data-set was Slashdotted. In the ensuing discussion parallels were drawn to the release of AOL search data and asking whether this would finally put the kabash on any future release of user data for research purposes. The results are very interesting but do not quite point to such a drastic conclusion. In particular the notion of “anonymity” used in this paper, satisfactory from a mathematical point of view, is not consistent with the operable definition most users operate under.
To recap: In 2006 Netflix announced a one-million dollar prize to improve its recommendation service. The problem is deceptively simple to state: given past movie-ratings from a user, suggest other movies he/she will enjoy. This is a standard collaborative-filtering and machine learning problem. The solutions depend on access to massive amounts of training data. More data allows the algorithms to better understand user-data at very nuanced level and improve its predictions. Netflix was up to the challenge, releasing a very large data-set containing 100 million ratings from half-million users– almost one out of eight customer at the time. This data-set was “anonymized” according to the Netflix definition. Customers were identified by numbers only, no names, no personally-identifiable information or even demographic data– such as age, gender, education etc. which may actually have been useful for predictions.
On the surface, the new paper shows that in fact user can be identified from this stripped down data– this is the interpretation which fueled the Slashdot speculation. Reality is more complex: the main quantitative result from the paper is that movie-ratings for an individual are highly unique. No two people have similar ratings. (The data-set is “sparse” to use the proper language.) In fact they are so unique that there is unlikely to be two people who agree on their ratings for even a handful of movies. This effect is even more pronounced when the movies are more obscure– knowing that a user watched an obscure Ingmar Bergman movie sets them apart from the crowd.
Looked another way: suppose there is a large source of movie ratings, as in the Netflix prize data-set. If the goal is to locate a particular user whose identity has been masked, getting hold of just a few of their ratings from another source will be enough. Even among millions of users, there is unlikely to be a second person with the exact same tastes. In some ways this is intuitive but the important contribution of the paper is quantifying the effect and calculating exactly how much data is required for unique identification with high confidence. Answer: less than a dozen movie ratings, less if the movies are obscure or the dates the user watched the movie is also available– indirectly this is in the Netflix data set as the data the user provided a rating.
“From another source” is the critical caveat above. In the paper this is referred to as the auxilary data source. What the paper demonstrates is that the Netflix data set can be linked to another data-set. Linking can be a serious privacy problem: it allows aggregating information from different databases. If a database existed where personally identifiable information was stored along with a small number of movie ratings, that record could be matched against the Netflix data. That is the catch: there is no such database. As the auxiliary for mounting the attack, the paper uses Internet Movie Database or IMDB, where users volunteer movie reviews. That allows correlating the data on IMDB for that user with all the other ratings from Netflix. As for the typical user profile on IMDB, the registration page asks for email address, gender, year of birth, zipcode and country– all of them volunteered by the user and subject to no validation. This is the “90210” phenomenon: any time data is mandatory without good reason and no way to enforce accuracy, the service provider ends up with many people named “John Smith” living in Hollywood, zip-code 90210, CA. That means the effect of linking IMDB and Netflix data is to discover that user nicknamed Mickey Mouse is a fan of Disney movies. Unless the user volunteered any more data to IMDB, the correlation does not get us any closer to that subscriber’s identity offline. All of the arguments about inferring sensitive information from movie ratings still hold (for example political affiliation based on their responses to a controversial documentary) but the data is now associated with user “MickeyMouse” instead of user #1234 in the original source. A step closer to identification? Perhaps. Fully identified? Not even close.
By itself the Netflix data set is not dangerous. That is a sharp contrast from the earlier AOL search data disaster: search history is sufficient, on its own to uniquely identify users because individuals enter private information into search queries such as their legal name or address. More important there is no limit to what search logs can contain: queries for health-conditions could be used to infer medical data, search for news may suggest professional interests etc.