[Final piece of the series, see first and second posts.]
The greater challenge with trying to create unlinkable user identifiers on the web is the ease of linking them online. Standard models of “linking” assume that the sites the user visited get together offline, long after the user has visited both of them, and try to ascertain whether two activity sequences they observed belong to the same users. It is relatively easy in this model to come up with ways of assigning identifiers to users that are deterministic, unique to each site and computationally difficult to link even when multiple sites collude.
Problem is, websites are not constrained to this simplistic attack model. Even today user tracking on the web involves a type collusion enabled by one of the elementary assumptions in HTML: any website is free to include content from any other website. That is by design. Any website can include an image, a frame, a video or song from another website. That means the web browser will automatically follow hypertext links crafted by one site and pointing to another website. That is a problem for privacy– a link can encode arbitrary information.
Consider an authentication system that assigns cryptographically unlinkable identifiers to users. The movie rental website knows this user as #123 while the bookstore knows her as #456. Enterprising marketing teams at these websites decide they want to collude and link user information. End goal is that the bookstore learns the users’ movie preferences and the video store gets an idea about her library, in the hopes they can create personalized offers. This is going to be a tall order when users are offline, because there is no unique identifier to key off. (Assuming we suspend disbelief– in reality credit card numbers or shipping addresses are the fly in the ointment as explained in the second part.) Instead they must capitalize on that window of opportunity when the user is online and logged into both sites.
Every page on the bookstore website has an image or other embedded content pointing to the video download site, and vice verse. Using transparent 1×1 images is customary for this purpose but such attempts at stealth are not required. The link for the embedded content contains the pairwise-unique identifier for the user observed by one site. When the user follows that link, they are going to be communicating two identifiers:
- First one is implicitly encoded in that link crafted by the sender, say #123. This is the identifier observed by the originating site and it will be encoded in the URL, or other piece of the request such as the form fields.
- Second identifier is explicitly asserted in the authentication protocol used by the destination. This is the identifier associated with the destination, say #456.
At this point the destination site has enough information to link the two: user #123 at the bookstore is same person as user #456 over here. Once that association is made, databases can be joined offline: everything about her book purchases can be joined against everything known about her tastes in bad 1980s cinema.
Granted this attack has some significant limitations: the user must be authenticated at both sites at simultaneously, or at least have some persistent identifier (such as a cookie) stored on both sides that encodes their identity. This turns out not to be a significant limitation, since users do authenticate to multiple sites in a single browser session, and in any case they need to fall into this trap just once for the permanent linkage to be created. A bigger problem is that linkage is limited to a pair of websites only. If 10 websites need to collude, there are 45 pairs of identifiers to sort out so this approach implemented naively would not scale.
Fortunately or unfortunately depending on the perspective, having each pair of websites in the conspiracy link identifiers is not necessary. A much simpler solution is to designate a single tracking agent against which everyone else’s identifiers are linked. Every website embeds content from this one site, which observes and stores all of the identity pairs observed together.
In the real world of course such tracking agents go by a more mundane name: advertising network. Display advertising networks have the unique benefit that they in fact appear as embedded content any number of websites, by design. Any time a user is authenticated to multiple sites and these sites contain third-party content hosted by the network, there is an opportunity to link the two identities together. In fact explicit authentication to the advertising network is not required: even a weak, temporary identity such as session cookie works: anytime more than one external ID is observed, that creates a permanent record. If the network observes #123 and #456 appearing in one session today, and sees #456 and #987 in an independent session tomorrow, the conclusion is all three identities are linked.
What this suggests is that until automatic loading of embedded content on pages is controlled better, unlinkable identities will be facing an uphill battle against one of the basic design principles behind the web.