Scraping, or how to weaken authentication systems

The current issue of Wired is running an article on “scraping” or recovering data from other online services. It tries to paint a balanced picture of why large providers including Craiglist have been highly ambivalent about the practice, welcoming the increased attention/relevance but also agonizing over the increased load on the system, as well as lost revenue opportunities when the data is monetized by a free-loader. (In the case of Craigslist, the website that mined/reformatted listings  was shut out because it featured Google Adsense, violating the prohibition against commercial use of data.)

One point the article glossed over is the distinction between scraping public vs. private data. Many websites do not require any type of authentication prior to retrieving data. Craiglist is an example: posting a classified may require login but viewing the listings does not. By contrast, scraping address-book contacts from an email provider such as Hotmail is not possible unless authorized by the user. The way Facebook and other invasive websites accomplish this is by asking the user for their credentials and then logging in as that user behind the scenes to access personal data.

This is a very bad idea for many reasons explained elsewhere as well, all of which boil down to the observation that sharing a credential with a 3rd party weakens the identity management system. Hotmail passwords (more precisely, Windows Live ID since that is the single sign-on solution used by MSFT properties) are intended for only WLID and the user. Having any other entity in possession of this information nothing more than unnecessary attack surface. To pick on the Facebook example used in the article: did Facebook delete that credential after importing the user’s contacts from Live Mail/Yahoo/GMail etc? Or did it save a copy for future scraping excursions?  Did it make a good-intentioned attempt to delete it but instead ended up writing it to log files replicated around the world, visible for any employee to see?

There is no way to know, and that is the problem. In defense of Facebook, part of the problem is that the protocols required to “do the right thing” for security did not exist until recently. Importing contacts is an authorization problem: grant Facebook access to data stored about the user by a 3rd party such as Yahoo. There is a deceptively simple solution: give Facebook the password and it can “become” the user, accessing any information it needs. As well as information it did not need:  contents of email message, RSS feeds on Live homepage, roaming favorites, XBox Live account, travel itineraries at Expedia and in the future even personal files stored in the cloud. And it need not stop at simply importing information: it can also delete contacts, spam your friends with advertisements that appear to originate from you, post enthusiastic, ghost-written endorsements of Facebook to your Spaces blog. The damage potential is open-ended by virtue of Passport/Live ID being a multi-site authentication system, making it the worst-case scenario in case Facebook proves malicious or more likely incompetent, in keeping with Robert Heinlein’s principle. There is no reason to suspect Facebook is doing any of this but there is no way to know either. Most online services do not expose transaction history to users; it’s not possible to check if another entity capable of acting as your Doppelganger has been rummaging around your personal data.

In other words sharing the password violates the principle of least privilege: it may solve the immediate problem but it grants the 3rd party unchecked authority greatly exceeding what was justifiable. This confusion around authentication vs. authorization is everywhere. In order to authorize access, it is not necessary for the other person to be able to authenticate as you. (That is the end result from sharing the password but also other schemes such as constrained delegation, where  a more constrained type of impersonation occurs without the password getting shared.) OAuth is a new protocol designed to address this problem. It’s built around the idea of one service asking for permission from a user to access his/her data stored by another service. The data custodian is still responsible for the permissions and UI for granting/revoking them and the requesting site authenticates as itself instead of “cloaking” itself in user credentials. It remains to be seen whether OAuth will succeed in replacing other proprietary solutions along same lines.


Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s