Last week a local Portland station drew attention to Nordstrom piloting a program to track shoppers in-store using unique identifiers from their smart-phone. Intriguing quote from an article on StorefrontBacktalk covering the same story:
“To be precise, the MAC addresses of those shoppers are not being stored by Euclid; instead, a hashed version of those MAC addresses is being stored.”
The subtext of this statement is an article of faith about hashing: replacing sensitive information by its hash can magically assuage privacy concerns associated with the collection of personally identifiable information. In the case of Nordstrom and Euclid, the data in question is the unique hardware identifier of the wireless network adapter present in most smartphones. While exact details of the hashing process are not given in the article or for that matter Euclid website, some very general arguments can be advanced to the effect that hashing MAC addresses is unlikely to help in this case.
Quick detour into cryptography: a cryptographic hash function (hash function for short) is a mathematical abstraction designed to be easy to compute forward but difficult to invert. That is: given some input message M of any size– could be as short as an email address or as large as an MP3 file– we can compute a concise digest of that message quickly by applying a prescribed algorithm. But given such a fingerprint that came out of a computation where we were not privy to the original input, it should be very difficult– in other words, require inordinate amounts of computing power– to run the function backwards and come up with a message that could have been used as starting point to produce that fingerprint. (For completeness, there are additional requirements around pair-wise collision resistance, but these will not come into play.)
Storing hashed version of sensitive information looks like a privacy win. Instead of storing the MAC address of a shopper “00:87:44:D3:50:A4” we run that through a well-known hash function such as SHA1 and store the output: e266b50d6a98dafc962e9b7724092304170a3b8a. That may look like merely replacing one sequence of indecipherable symbols by another, but it hides information present in the original. For example MAC addresses have internal structure, with the first three bytes assigned to the hardware manufacturer. By looking up those digits in a public registry, it is possible to determine that. This alone can help distinguish users by the type of phone they carrying, since all units for a given model usually have the same type of wireless adapter. A good hash function wipes out such information. There is no correlation between the first-three digits of MAC and the output. Such simple mappings have been scrambled. The second benefit is that hashing can make it more difficult to link one observation (“user with hashed MAC address X walked into the store at 2:48PM”) against others involving the same person from other data sources, such as (“user with MAC address 008744D350A4 has Twitter handle of @alice”) by removing the common identifier that ties these records together.
There are two problems with this line of argument:
1. “One-way” is a computational notion. It only means there is no efficient algorithm to invert the function, to go from observed hash back to an input that generated it. This does not preclude very inefficient options, such as hashing all possible inputs to find the right one. Whether that is feasible depends on the number of candidates.
2. To prevent linking across different datasets based on a unique identifier such as MAC address, everyone has to adopt hashing. (Not only that, but use incompatible hash functions on purpose. Otherwise if everyone picked an unmodified function such as SHA1, then “SHA1 of MAC address” becomes the new de facto unique identifier for correlation.) It is not possible for one data owner to unilaterally prevent future linking by hashing their own records.
The Nordstrom scenario is an example of the first problem. If it proves “easy” to recover original MAC addresses from hashed versions, the benefits vanish.
To take an extreme case where hashing clearly does not help: consider health records database with one particularly sensitive column. This column can take two values: zero or one, depending on whether the patient tested negative or positive for performance enhancing substances. Replacing “0” with hash of 0 and “1” with the hash of 1 does absolutely nothing to improve the privacy of these records, for any choice of hash function. There are just too few possibilities: anyone with access to the records and knows the hash function can try both zero and one to uncover the status of any subject. (In fact for such an extreme case one need not even know the hash function– eg mixing a secret key into the process does not help either. If there is any a priori information about expected percent of cheaters, simply looking at the total incidence of two different values will suffice. For example if we assume optimistically that crooks are in the minority, then the hash value that appears less often must be the one corresponding to positive test.)
MAC addresses have a lot more than two possible values: roughly 281 trillion.** That may seem intractable but helping the attacker is the surprising efficiency of common hash functions, and how quickly they can run on modern hardware, an evolution driven in large part by research on password cracking. (In fact since passwords are typically stored in salted hashed form, that they can be recovered at all is rebuttal to naïvely equating hashing with privacy.) Using the popular cryptographic hash function SHA1 as an example, a single high-end GPU can grind through one billion hashes per second. Cluster together three dozen such processors, or better yet rent them from Amazon, and every possible MAC can be compared against a given mystery hash. (It should be emphasized that we do not know what hashing algorithm Euclid is using. But this is an example where a perfectly reasonable choice employed in many security applications fails to provide privacy.)
The picture gets worse when considering attacks against large number of users. Spending hours of computing time to recover a single MAC address may not seem economically viable for data mining purposes. But the marginal cost of inverting one more hash drops rapidly, thanks to more efficient cryptographic attacks using time-memory tradeoffs. These call for an upfront, sizable pre-computation phase to build a massive table which can be used later to crack individual hashes much faster than exhaustive search. The algorithm in effect takes up more storage space but reduces the time of each search. Refinements of this idea underlie the rainbow-table approach used for cracking Windows passwords. In other words, the cost of recovering MAC address does not scale linearly in the number of user. Bulk deanonymization is only slightly more expensive than going after a handful of individuals.
Bottom line: hashing does not magically anonymize personally identifiable information. In this case, there may not be much difference in privacy between storing MAC addresses and storing hashes. Without additional context, there is little reason to take comfort in a blanket statement to the effect that sensitive data is hashed.
CP
** In reality, the hierarchical assignment of MAC ranges to different hardware manufacturers, combined with the fact that only certain models appear in smart phones, greatly reduces the range of possibilities. Here we assume worst case scenario.