Continuing revelations from The Guardian and Washington Post about the extent of US and British surveillance over Internet communications is once again raising questions around privacy and cloud computing. This is not the first time critics have argued that increasingly storing greater amounts of data with remote services is a step backwards. But in previous instances underlying issues were often quirks of regulation, such as ECPA setting a lower bar for access to stored communications.
There is a more fundamental reason cloud services amplify privacy risks, if not necessarily create them in the first place. The ideal model from a commercial perspective calls for hoarding user data and having free reign over processing that data internally. That freedom enables services built on intelligent ways of crunching information to generate new value. The ideal model from the perspective of user privacy calls for minimizing data collection and keeping the provider at arm’s length from having direct access to data. Models optimizing for privacy stand at significant disadvantage in economic terms.
It comes down to a distinction between two different uses of the cloud: storage and processing. Storage is a commodity. Processing is not. Storage can be made privacy-friendly easily, processing can not.
Let’s take a simple example: backing up files to protect against data loss. Hardware failures happen, disks crash, sometimes entire systems are burglarized from residences. Each person trying to protect against this individually becomes unwieldy: imagine backing up your data regularly on drives and locking them away in bank vaults. There are clear economies of scale from centralizing that into commercial services, offering users the option to have their data remotely backed up over the network to the cloud– which is short hand for distributing it across data centers that may be located anywhere in the world. This enables another use case: since the information uploaded is always accessible from anywhere with a network connection– unlike offline backups stored in a vault– consumers also enjoy the benefit of mobility.
The critical question: can the provider offering this service read uploaded data? In principle there is no need. All of the information can be encrypted by the user before getting uploaded to the cloud, using cryptographic keys that are only known to the owner of that information. If the user experiences data loss and wants to restore the files lost, they download the encrypted bits from the cloud, then use those keys to decrypt locally to recover the original information.
The problem is few services operate this way. There is one design/engineering reason, and one business reason for that. To get the design argument out-of-the-way: “users can not be trusted to manage their own encryption keys” the critiques charge. “If they lose access to keys, we can not be the ones to tell them all their data is gone.” (One amusing manifestation of this is a type of design where data is encrypted but the encryption keys are escrowed with service provider– in other words, useless window dressing.) Aside from the obvious problem of patronizing users, this argument overlooks the obvious fact that many fielded systems work exactly that way. BitLocker disk encryption technology in Windows makes it very clear that loss of keys means loss of access to encrypted volumes. It nudges and cajoles the user into printing a hard-copy of recovery keys for safe keeping to forestall that outcome.
The deeper problem is economical.
Remote storage is a mature technology with limited room for innovation. It has a simple, one-dimensional competitive model based on price per gigabyte of capacity. Some variability can be thrown in by an escalating feature arms-race: Does it support sharing? Can files be accessed from mobile devices? Is syncing automated? Yet it is difficult to distinguish a service based on these features because at the end of the day most providers have converged on similar paradigms with comparable feature set. Cloud storage appears as a local folder or drive, with items uploaded by simply dragging icons into that location, using the familiar GUI metaphor. Once feature sets have reached equilibrium and all the check-boxes are marked in the comparison table, the result is a race to the bottom in pricing between interchangeable services, competing on offering most storage at lowest cost. That number will quickly converge to zero.
It is much easier to compete on clever ways of processing uploaded data, “adding value” in the IT lexicon. For example, scan all the documents for keywords and offer full-text search. Index all of the photographs, sort them by location and time taken, identify faces for tagging. Allow publishing those images for all the world to admire, or limited sharing with groups of friends. Given a spreadsheet modeling financial data, keep it updated with stock quotes in real-time. Sometimes the processing is for the benefit of the provider: scan email messages for keywords to display relevant advertising. The extreme case is “pure” cloud computing, where all processing of the data, including the process of creating/editing the documents is done by interacting with the service online.
Such intelligence built into the hosted service is a sustainable advantage that can continue differentiating the business over time. The price of storage continues to drop thanks to Moore’s law but that has an equalizing effect. Not only does a rising tide lift all boats, it wipes out any temporary advantage. Even if one provider temporarily builds a cheaper/more efficient storage system using custom in-house designs, less skilled competitors will sooner or later close the gap when they purchase the next generation hardware off-the-shelf. (Better yet, they can outsource storage requirements to an existing cloud platform such as EC2 or Azure to benefit from their economies of scale.) By contrast better algorithms for crunching customer data and producing new information provide a competitive edge that is more difficult to replicate. Improvements in hardware alone do not help achieve parity. Nor can these capabilities that purchased from a third-party as part of a standardized offering.
To summarize: economic pressures on cloud services create strong incentives for amassing customer data in ways that can be readily processed. It is not an appealing proposition to become glorified disk drives in the cloud storing opaque blocks of encrypted information the service has no visibility into. That incentive structure means that cloud services will continue to concentrate information security risk in the short-term. User data may well be protected in transit and even in storage– websites boast of impressive-sounding data practices such as “military grade AES 256-bit encryption.” At the end of the day, the business model depends on being able to recover the original data and operate on it. Regardless of how many layers of encryption exist, somewhere some machine controlled by the service provider has the ability to reconstruct the data. That means so can other people: disgruntled employees, foreign governments such as China conducting industrial espionage, as well as overreaching surveillance programs from US intelligence.
** Is there a middle ground? Some processing can be done on encrypted data. From the early days of cryptography, researchers noted that some encryption schemes had these useful properties: given encrypted version of unknown values, sometimes it was possible to compute a function of the original values, such as their sum or product. But these remained parlor tricks. It was not until more recently that the Holy Grail of the field, fully homomorphic encryption (FHE) schemes were first constructed, allowing the computation of arbitrary functions on encrypted data– in theory. The operative keyword remains “theory”– these schemes are extremely inefficient in the generalized case. For some important use cases such as searching over encrypted text, more specialized , practical implementation exist. To date no major cloud service has attempted to commercialize that model.