HTTP Referer [sic] header has become something of a favorite villain in web privacy controversies. Misspelled with a single “r” due to historical reasons, the evolution of this header is an interesting example of how features can have unintended side-effects. Introduced in HTTP/1.0, the first version standardized by IETF, it was intended as a diagnostics mechanism:
This allows a server to generate lists of back-links to resources for interest, logging, optimized caching, etc. It also allows obsolete or mistyped links to be traced for maintenance.
When a user clicks on a link from one site to visit another, her web browser transmits a hint to the effect that “I followed a link from this other page to arrive here.” In the early vision of the web as a small, friendly place populated by academic researchers, one could imagine a web administrator reaching out to another to thank them for bringing new users to their site. Or they could politely inform their counterpart that a page they linked to has been moved or deleted, to suggest that the outdated link be corrected. (As an aside this is also the solution to an imagined problem that troubles Jaron Lanier in Who owns the future? In his critique of the web as a platform for enabling exploitation of content creators, the author cites the unidirectional nature of hyperlinks as root cause of an imbalance of power. Google profits by sending users to other websites where the real content of interest is located, but the authors who painstakingly created that content in the first place do not share in the economic gains. Lanier incorrectly assumed this is because sites can not identify the “source” to be credited for bringing users. But this is exactly what the Referer header does. Incidentally Tim-Berners Lee originally envisioned hyperlinks as bidirectional, and Referer can be viewed as a way to approximate that approach.)
The web today is not exactly the collegiate, friendly community from 1993. Trying to fix every broken incoming link by tracking down the authors would be a lost cause. Yet there are still benefits to knowing where traffic has originated from. Contemporary business models for websites depend heavily on monetizing traffic indirectly, for example by advertising or mining user-data. Scaling that effort in turn involves running various campaigns to generate traffic and bring in more “eyeballs” in industry parlance. Knowing where that traffic originated can help the website better optimize its customer acquisition plans. For example they can distinguish between users clicking on simple text ads on Google, verses rich banner ads or social media mentions. The same argument applies for embedded content. When one page includes an image, video or other web content provided by another page, the latter gets to learn about the identity of that first-party using its content.
As is often the case in policy issues, one person’s brilliant marketing idea is another’s privacy nightmare. What made the Referer seem like a good idea in 1993 is exactly the same reason it poses a privacy problem: it allows sites to learn users’ navigation patterns. To be clear, this header alone is not enough for tracking. It takes cookies and reuse of the same third-party content from multiple sites that allows building up such a profile. Much to the dismay of privacy advocates, that scenario arises quite frequently in the context of advertising networks. For example DoubleClick— acquired by Google in 2008– provides banner ads for tens of thousands of websites. These ads are included by the publisher— the original website the user visited– embedding third-party content on their pages hosted by DoubleClick servers. When a web browser is rendering that publisher page, requests are made to DoubleClick with Referer header bearing the address of the publisher. (As we will see in the second half of this post, this is not the only way for DoubleClick to find out the originating party.) DoubleClick maintains a long-lived third-party cookie for identifying visitors across different sessions. Each time a new Referer is encountered for an ad impression associated with that particular cookie, the advertiser can make a note of the website the user happened to be visiting. Multiply this by thousands of websites embedding banner ads, you get a comprehensive picture of one user’s web surfing behavior, indexed by the unique identifier in that cookie.
There is a different type of information disclosure that the referer header can introduce, which is not intended by the origin or destination websites. This happens when secret information used for access-control are encoded in the query-string portion of the URL. For example it could be an authentication token, password-reset code or other secret used for access control. Consider what happens when such a page embeds content from a third-party website such as an image. When fetching that resource, browser sends a Referer header containing the complete URL (minus fragment identifier) of the current page. Because that includes sensitive data carried in the query-string, the third-party website is now able to impersonate the user or otherwise access private user information stored at the originating site. This type of vulnerability is called a referrer leak. Same outcome happens with a delay if the user were to click on a link from that page to navigate to an external site. There are a couple ways to mitigate this risk. Using a POST instead of GET will keep sensitive form parameters in the content of the form, instead of as part of the URL. Another option is to diligently perform another redirect back to the current page, minus any sensitive query-string parameters. This only works if these parameters can be stashed someplace else such as in a cookie, since they are typically an integral part of the flow.
But given broader privacy concerns with Referer, does it make sense to deprecate this header altogether? In the second part of this post, we will look at some attempts at doing that and argue they are fundamentally incapable of addressing the privacy problem.