Cloud backup and privacy: the problem with SpiderOak (part II)


In the crowded field of online backup services, SpiderOak is an example of a company trying to distinguish itself on privacy. Billing itself a “zero-knowledge privacy environment,” the company emphasizes what they can NOT do:

SpiderOak is, in fact, truly zero knowledge.  The only thing we know for sure about your data is how many encrypted data blocks it uses […]  On the servers, we only see sequentially numbered data blocks — not your foldernames, filenames, etc.

As expected, this also translates into a limitation around password reset:

How is this reconciled with our ability to do a password reset?  The short answer is: It isn’t!  We cannot reset your password.  When you create a SpiderOak account, the setup process happens on your computer […] and there your password is used in combination with a strong key derivation function to create your outer layer encryption keys. Your password is never stored as part of the data sent to SpiderOak servers.

So far, so good. All user data is encrypted using keys derived from the password, before that information is backed up to the cloud. That password in turn is never communicated to the cloud provider. On the surface this appears to satisfy property #3 (and by implication #2) alluded to in the previous post: the service provider can not access user data even with full use of its own resources.

But there is a catch: values derived from the password are stored. The details are buried in the engineering matters section, under “User Authentication Details.” Ostensibly written to assure users that the protocol for verifying knowledge of the password is sound, it amounts to an admission that there is something stored by the service provider that can be used to distinguish correct versus incorrect password submissions. Specifically:

  1. Two random salts, stored in the clear by necessity
  2. A serialized RSA public key, also stored as plaintext
  3. A “challenge key” that is computed as output as a specific key-derivation function with the second salt, namely PBKDF2(password, first salt)
  4. Full RSA key including the private-half, AES-encrypted using the output of PBKDF2(password, second salt) as the encryption key

That combination serves as a password hash. It can be brute-forced. Given the first random salt and challenge key, it is possible to check if a password guess such as “asdfgh” is correct by re-computing the same key-derivation process via PBKDF2 and comparing the result to the stored value. That means it is in fact possible to recover data by trying large number of possible passwords. While the effectiveness of such an attack depends on the user choice of password and computing power available to the attacker, the risk calculus is the same in all cases. Data recovery can be attempted by the service provider going rogue, a disgruntled employee acting independently or law-enforcement/intelligence agency who obtains access to the encrypted data from the provider. This is in fact corroborated by one of the privacy FAQs directly taking up the question of whether user data can be recovered with access to bits stored in the cloud:

Unless there are significant advances in mathematics […] password derivation techniques on the SpiderOak key structure are very difficult. The key derivation functions we use are strongly designed to withstand heavy brute force password techniques and pre-computation, such that even on a very modern computer, each password guess takes about one second. […]  Of course, if you were to choose a password that is made entirely from words in a dictionary, fewer attempts may be needed to guess it.

That is the glass-half-full view. Key derivation is indeed using PBKDF2 with a reasonable number of iterations set to 16384. But already password cracking schemes have been built by hobbyists achieving billions of hashes per second, where the hash function is the underlying primitive operation. Bumping up the repetitions helps quantitatively, but does not address the root cause. As ArsTechnica found out much to their surprise, that random looking “qeadzcwrsfxv1331” may not be a great choice after all.

In case this seems like an inescapable consequence of how encryption works, consider a hypothetical alternative design. Suppose a user manages their own RSA encryption key, stored on their machine. This key is used to encrypt a randomly generated AES key, which is in turn used to encrypt bulk data uploaded to the cloud. In this model, there is no password to brute-force from any data uploaded to the cloud. Ciphertext available to the cloud provider is encrypted in a truly random 128-bit key, where all possible choices of the key are equally likely. (As an aside: that RSA private key may be locally encrypted with a user-chosen passphrase, which sounds like rearranging deck chairs.  There is a critical difference: brute-forcing that key will require access to the user machine. There is nothing uploaded to the cloud that helps.) Of course this would mean the data is not accessible on other devices unless the private-key can be roamed there. That is why the ideal implementation would utilize smart cards instead of locally storing keys on disk. Still the possibility of excluding brute-force attacks can be demonstrated without resorting to any fancy gadgets.

There is a broader architectural flaw here. Designs in the spirit of SpiderOak are badly conflating two orthogonal problems:

  • Encrypting user data with keys that are managed directly by the user and not available to any third-party
  • Saving the resulting ciphertext after encryption to a third-party cloud provider

Many popular solutions already exist for the first problem, with different security properties, key management options and cross-platform availability: BitLocker, PGP disk encryption, truecrypt, loop-aes, FileVault, … There is little reason to introduce yet another arbitrary scheme with new risks– in this case, susceptibility to brute forcing by the cloud provider.

Following posts will look at experimental ways to “compose” existing local encryption schemes with cloud backup services transparently, without giving up any control over cryptography and key management.

CP

4 thoughts on “Cloud backup and privacy: the problem with SpiderOak (part II)

  1. I look after customer service at Conformal Systems and a customer of ours recently asked me to compare our online backup Cyphertite with SpiderOak. I knew our encryption process was very strong, but wanted to contrast it with what SpiderOak does. After spending some time digging pretty deep into their site, I was unable to find specific details on the encryption process they use. After reading this blog it got me to thinking about SpiderOak’s claim to be open source. It is true that they have released some of the libraries they have built, but some major details about the crypto are missing. If you have a strong encryption process, it is unlikely it will be broken even with the details published. Why not share the encryption details and the code so the community can review and verify it? We have our entire encryption process readily available on our web site and the code is also there to verify we do what we say we do with your data. If you are interested:

    Site: https://www.cyphertite.com
    Crypto White Pwper: https://www.cyphertite.com/papers/WP_Crypto.pdf
    Source Code: https://www.cyphertite.com/download.php

  2. Why do you imply that the brute force for what SpiderOak stores falls under the fast hash category? It clearly states in the Ars article you link that it is a slow hash, reducing that few billions per second to a few thousand per second. That is a factor of a few million in difference and an egregious error in your analysis.

    • Where does it imply that it is a fast hash? The paragraph specifically says:

      “Key derivation is indeed using PBKDF2 with a reasonable number of iterations set to 16384.”

      Meanwhile the ArsTechnica article has no connection to SpiderOak. It provides general context around speed of the primitive operation (which is repeated 16384 times for SpiderOak as stated in the post) and how sophisticated dictionary models have become in generating guesses.

      (BTW this is taking SpiderOak at their word, since there is no way to confirm the assertion. For all we know they could be storing passwords in the clear or could make a targeted change to use a weaker hash for a particular user in response to a subpoena.)

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s