Notions of privacy in cloud-computing
Private computation in the cloud has been around—in concept, if not in actual implementation— for almost as long cloud computing or “grid computing” as its early predecessors were known by a distinctly industrial sounding moniker. From the outset, concerns about data security has been one of the primary obstacles to outsourcing computing infrastructure to third-parties. What happens to proprietary company information when it is sitting on servers owned by somebody else? Can this cloud-provider be trusted to not peek at customer data or tamper with the operation of services that tenants run inside the virtual environment? Can the IaaS provider guarantee that some rogue employee can not help themselves to the confidential data uploaded there? What protections exist if government agencies with creative interpretations of the fourth-amendment show up at the door? Initially cloud providers were quick to brush aside these concerns, or at best respond with appeals to brand authority and credentials (ISO 27001 certified, PCI-compliant etc.) Once customers proved skeptical and demanded actual evidence of operational security, special cases were crafted: Amazon provides a dedicated cloud for its government customers, presumably with improved controls and isolated from the unwashed masses with their own VMs running less sensitive applications.
Meanwhile the research community has welcomed the emerging questions as the starting point for a research agenda around computing on encrypted data. These schemes take a drastic step by placing no trust in cloud providers, positing that they will only receive encrypted data which they can not decrypt- not even temporarily, an important distinction that critically fails for many of the existing systems, as we will see. The question becomes whether cloud services can still perform meaningful operations on encrypted data, such as searching for keywords or number-crunching, producing results which can only be decrypted by the original owner. That approach holds a lot of promise, provided it can be implemented efficiently. It preserves the advantage of cloud computing (lease CPU cycles, RAM and disk space from someone else on demand) while maintaining confidentiality of the data processed.
Between these ends of the spectrums, private-computation in the cloud appears to be caught in a chasm between a rock and a hard place:
- Stuff that can not possibly work: impassioned self-declarations of honesty/competence by the vendor
- Stuff that does not yet work: promising research in academic literature around special-cases without a feasible solution for the general case
Solutions in the first category effectively boil down to the premise: “trust us, we will not peek at your data.” Some of these are transparently non-technical in nature: for example warrant canaries are an attempt to work-around the gag-orders accompanying national security letters (NSLs) by using the absence of a statement to hint at some incursion by law enforcement. Others attempt to bury those critical trust assumptions in layers of complex technology.
Box & enterprise key-management
Take for instance Box’s enterprise key-management. On paper this is attempting to address a legitimate scenario discussed in earlier posts: storing data in the cloud encrypted such that the cloud-provider can not read the data even if it wanted to. This is a far-cry from how popular cloud storage providers operate today: by default Google Drive, Microsoft One Drive and Dropbox have full access to customer data. Sure, they may store that data encrypted within their own data-center– a capability hilariously touted as some ground-breaking privacy feature or competitive advantage. In reality such encryption is only there to protect against risks internal to the cloud service provider: rogue employees, theft of hardware from data-centers etc. At the end of the day that layer of encryption is fully reversible by the hosting service, without any cooperation required by the original custodian.
Far more robust are designs which guarantee that the storage provider can not recover user data, even if they were compelled or employees were hit with the evil stick. Until a few years ago, that level of security would have seemed a quixotic requirement out of paranoid minds. Along came Snowden and a sea-change happened: cloud-providers started trying to build actual cryptographic protections instead of vouching for their good intentions.
The solution Box has announced with much fanfare decidedly fails to achieve that objective. To see, why let’s review the outline of that design to the extent that can be gleamed from public sources:
- There is a master-key for each “customer” (defined as an enterprise, rather than end-user; recall that Box distinguishes itself from Dropbox and similar services by focusing on managed IT environments.)
- As before, individual files uploaded are encrypted with a key that Box generates.
- The new twist is that those individual bulk-encryption keys are in turn encrypted by the master-key
- So far, this is only adding a hierarchical aspect to key management. Where EKM is different is transferring custody of the master-key back to the customer, specifically to HSMs hosted at AWS (via CloudHSM) and backed up by additional HSMs hosted in the customer data-center carrying replicas of the same secrets. (It is unclear whether this is a symmetric key or asymmetric key-pair but the latter design would make far more sense. It would allow encryption to proceed locally without involving remote HSMs and only decryption to require interaction.)
Box implies that this last step is sufficient to provide “Exclusive key control – Box can’t see the customer’s key, can’t read it or copy it.”
Is that sufficient? Let’s ask what could go wrong.
Trust by any other name
First observe that the bulk-data encryption keys are generated by Box. These keys need to be generated “randomly” and discarded afterwards, keeping only the version wrapped by the master-key. A trivial way for Box to retain access to customer data— for example, if ordered by law enforcement— is to generate keys using a predictable scheme or simply stash aside the original key.
Second, note that Box can still decrypt data anytime, as long as the HSM interface is up. Consider what happens when employee Alice uploads a file and shares it with employee Bob. At some future instant, Bob will need to get a decrypted copy of this file on his machine. By virtue of the fact Box must be given access to HSMs, there must exist at least one path where that decryption takes place within Box environment, with Box making an authenticated call to the HSM. (Tangent: Box has a smart-client and mobile app so in theory decryption could also be taking place on the end-user PC. In that model HSM access is granted to customer devices instead of Box service itself, keeping the trust boundary internal to the organization. But that model faces practical difficulties in implementation. Among other things, HSM access involves some shared credentials- for example in the case of Safenet Luna SA7000s used by CloudHSM, there is a partition passphrase that would need to be distributed to all clients. There is also the problem that user Alice could decrypt any document, even those she did not have access to by permission. Working around such issues would require additional infrastructure, such as placing another service in front of HSMs to authenticates users based on their own enterprise identity, rather than Box account. Even then there is the scenario for files from a web-browser when no such intelligence exists to perform on the fly decryption client-side.) That raises two problems. Most obvious one is that the call does not capture user intent. As Box notes, any requests to HSM will create an audit-trail but that is not sufficient to distinguish between the cases:
- Employee Bob is really trying to download the file Alice uploaded
- Some Box insider went rogue and wants to read that document
While there is an authentication step required to access HSMs, those protocols can not express whether Box is acting autonomously versus acting on behalf of a user at the other side of the transaction requesting a document.
The second problem applies even if Box refrains from making additional HSM calls in order to avoid arousing suspicion (Just to be on the safe side, in case the enterprise is checking HSM requests against records of what documents its own employees accessed, even though the latter is provided by Box and presumably subject to falsification.) During routine use of Box, in the very act of sharing content between users, plaintext of the document is exposed. If Box wanted to start logging documents— because it has gone rogue or is being compelled by law enforcement— it could simply wait until another user tries to download the same document, in which case decryption will happen. No spurious HSM calls are required. For that matter Box could wait until Alice makes some revisions to the document and uploads a new version in plaintext.
Point-in-time trust vs ongoing trust
Stepping back from these specific objections, there is an even more fundamental flaw in this concept: customers still have to trust that Box has in fact implemented a system that works as advertised. This is not one-time trust at the outset, but ongoing trust for the lifetime of the service. The first case would have been easy to accept. It is the type of optimistic assumption one makes all the time: when purchasing hardware from a manufacturer, one hopes the units were not back-doored. If the manufacturer decides to go rogue after the units are shipped, it is too late; they can not go back and compromise existing inventory in the field. (Barring auto-update or remote-access mechanisms, of course.) Here the problem is much worse: Box can go “rogue” anytime to start logging cleartext data, silently escrow keys to another party or simply use weak keys that can be recovered later. Now current Box employees will no doubt swear upon a stack of post-IPO shares that no such shenanigans are taking place. This is the same refrain: “trust us, we are honest.” They are almost certainly right. But to outsiders their cloud service is an opaque black-box: there is no way to verify that such claims are accurate. At best an independent audit may confirm the claims made by the service provider, corroborating one pledge of good-intentions with another ( “trust Ernst & Young, they are honest too”) without altering the core dynamic: this design critically relies on competent and faithful execution by cloud provider to guarantee privacy.
A miss is as good as a mile
Why single out Box when this is the modus operandi for most popular cloud-storage services? Because of all cloud scenarios, storage is most amenable to end-to-end privacy. Unlike searching encrypted documents or computing complex functions over a spreadsheet of encrypted columns, there is no cutting-edge research problem here. There are no breakthroughs required in efficient homomorphic encryption, no waiting for more iterations of Moore’s law to get hardware fast enough. It is already possible to implement remote storage in the cloud with nearly zero trust. In the same way that one does not have to worry about whether the disk-drive on their laptop is selling out their, one can design systems such that service providers are truly glorified disk-drives in the cloud with no visibility into the bits they are storing. (It is of course a different question whether such services can be as those that “add value” by actively processing stored data and trying to offer additional services around it.) Such providers compete on availability– providing service without having outages or losing data. They never have to compete on privacy by making heart-felt statements about their good intentions. Cryptography properly employed already solves that problem.