A clear view into AI risks: watching the watchers

A recent NYT expose on ClearView only scratches the surface on the problems with outsourcing critical law-enforcement functions to private companies. There is a lot of  To recap: ClearView AI is possibly the first startup to have commercialized face-recognition-as-a-service (FRaaS?) and riding high on a recent string of successes with police departments in the US. The usage model could not be any easier: upload an image of a person of interest, ClearView locates other pictures of the same person from its massive database of images scraped from public sources such as social media. Imagine going from a grainy surveillance image taken from a security camera to the LinkedIn profile of the suspect. It is worth pointing out that the services hosting the original images including Facebook were none too happy about the unauthorized scraping. Nor was there any consent from users to participate in this AI experiment; as with all things social-media, privacy is just an afterthought.

Aside from the blatant disregard for privacy, what could go wrong here?
NYT article already hints at one troubling dimension of the problem. While investigating ClearView, the NYT journalist asked various members of police departments  with authorized access to the system to search for himself. This experiment initially turned up several hits as expected, demonstrating the coverage of the system. But halfway through the experiment, something strange happened: suddenly the author “disappeared” from the system with no information returned on subsequent searches, even when using the same image successfully matched before. No satisfactory explanation for this came forward. At first it is chalked up to a deliberate “security feature” where the system detects and blocks unusual pattern of queries— presumably the same image being searched repeatedly? Later the founder claims it is a bug and it is eventually resolved. (Reading between the lines suggests a more conspiratorial interpretation: ClearView gets wind of a journalist writing an expose about the company and decides to remove some evidence that demonstrates the uncanny coverage of its database.)

Going with Hanlon’s razor and attributing this case of the “disappearing” person to an ordinary bug, the episode highlights two troubling issues:

  • ClearView learns which individuals are being searched
  • ClearView controls the results returned

Why is this problematic? Let’s start with the visibility issue, which is practically unavoidable. This means that a private company effectively knows who is under  investigation by law enforcement and in which jurisdiction. Imagine if every police department CCed Facebook every time they sent an email to announce that they are opening an investigation into citizen John Smith. That is a massive amount of trust placed in a private entity that is neither accountable to public oversight nor constrained by what it can do with that information.

Granted there are other situations when private companies are necessarily privy to ongoing investigations. Telcos have been servicing wiretaps and pen-registers for decades and more recently ISPs have been tapped as a treasure trove of information on the web-browsing history of their subscribers. But as the NYT article makes clear, ClearView is no Facebook or AT&T. Large companies like Facebook, Google and Microsoft receive thousands of subpoenas every year for customer information, and have developed procedures over time for compartmentalizing the existence of these requests. (For the most sensitive category of requests such as National Security Letters and FISA warrants, there are even more restrictive  procedures.) Are there comparable internal controls at ClearView? Does every employee have access to this information stream? What happens when one of those employees or one of their friends becomes the subject of an investigation?

For that matter, what prevents ClearView from capitalizing on its visibility into law-enforcement requests and trying to monetize both sides of the equation? What prevents the company from offering an “advance warning” service— for a fee of course— to alert individuals whenever they are being investigated?

Even if one posits that ClearView will act in an aboveboard manner and refrain from abusing its visibility into ongoing investigations for commercial gain, there is the question of operational security. Real-time knowledge of law enforcement actions is too tempting a target for criminals and nation states like to pass up. What happens when ClearView is breached by the Russian mob or an APT group working on behalf of China? One can imagine face-recognition systems also being applied to counter-intelligence scenarios to track foreign agents operating on US soil. If you are the nation sponsoring those agents, you want to know when their names come under scrutiny. More importantly you care whether it is the Poughkeepsie police department or the FBI asking the questions.

Being able to modify search results has equally troubling implications. It is a small leap from alerting someone that they are under investigation to withholding results or better yet, deliberately returning bogus information to throw off an investigation or frame an innocent person. The statistical nature of face-recognition and incompleteness of a database cobbled together from public sources makes it much easier to hide such deception. According to the Times, ClearView returns a match only about 75% of the time. (The article did not cite a figure for the false-positive rate, where the system returns results which are later proven to be incorrect.) Results withheld on purpose to protect designated individuals can easily blend in with legitimate failures to identify a face. Similarly ClearView could offer “immunity from face recognition” under the guise of Right To Be Forgotten requests, offering to delete all information about a person from their database— again for a fee presumably.

As before, even if ClearView avoids such dubious business models and remains dedicated to maintaining the integrity of its database, attackers who breach ClearView infrastructure can not be expected to have similar qualms. A few tweaks to metadata in the database could be enough to skew results. Not to mention that a successful breach is not necessary to poison the database to begin with: Facebook and LinkedIn are full of fake accounts with bogus information. Criminals almost certainly have been building such fake online personae by mashing bits of “true” information from different individuals.

This is a situation where ClearView spouting bromides about the importance of privacy and security will not cut it. Private enterprises afforded this much visibility into active police investigations and with this much influence over the outcome of those investigations need oversight. At a minimum companies like ClearView must be prevented from exploiting their privileged role for anything other than the stated purpose— aiding US law enforcement agencies. They need periodic independent audits to verify that sufficient security controls exist to prevent unauthorized parties from tapping into the sensitive information they are sitting on or subverting the integrity of results returned.



Filecoin, StorJ and the problem with decentralized storage (part II)

[continued from part I]

Quantifying reliability

Smart-contracts can not magically prevent hardware failures or compel a service provider at gun point to perform the advertised services. At best blockchains can facilitate contractual arrangements with a fairness criteria: the service provider gets paid if and only if they deliver the goods. Proofs-of-storage verified by a decentralized storage chain are an example of that model. It keeps service providers honest by making their revenue contingent on living up to the stated promise of storing customer data. But as the saying goes, past performance is no guarantee of future results. A service provider can produce the requisite proofs 99 times and then report all data is lost when it is time for the next one. This can happen because of an “honest” mistake or more troubling, because it is more profitable for decentralized providers to break existing contracts.

When it comes to honest mistakes—random hardware failures resulting in unrecoverable data loss— the guarantees that can be provided by decentralized storage alone are slightly weaker. This follows from a limitation with existing decentralized designs: their inability to express the reliability of storage systems, except in most rudimentary ways. All storage systems are subject to risks of hardware failure and data loss. That goes for AWS and Google. For all the sophistication of their custom-designed hardware they are still subject to laws of physics. There is still a mean-time-to-failure  associated with every component. It follows must be a design in place to cope with those failures across the board, ranging from making regular backups to having diesel generators ready to kick in when the grid power fails. We take for granted the existence of this massive infrastructure behind the scenes when dealing with the likes of Amazon. There is no such guarantee for a random counterparty on the blockchain.

Filecoin uses a proof-of-replication intended to show that not only does the storage provider have the data but they have multiple copies. (Ironically that involves introducing even more work on the storage provider to format data for storage— otherwise they can fool the test by re-encrypting one copy into multiple replicas when necessary— further pushing the economics away from the allegedly zero marginal cost.) That may seem comparable to the redundancy of AWS but it is not. Five disks sitting in the same basement hooked up to the same PC can claim “5-way replication.” But it is not meaningful redundancy because all five copies are subject to correlated risk, one lightning-strike or ransomware infection away from total data loss. By comparison Google operates data-centers around the world and can afford to put  each of those five copies in a different facility. Each one of those facilities still has a non-zero chance of burning to the ground or losing power during a natural disaster. As long as the locations are far enough from each other, those risks are largely uncorrelated. That key distinction is lost in the primitive notion of “replication” expressed by smart-contracts.

Unreliable by design

Reliability questions aside, there is a more troubling problem with the economics of decentralized storage. It may well be the most rational— read: profitable— strategy to operate an unreliable service deliberately designed to lose customer data. Here are two hypothetical examples to demonstrate the notion that on a blockchain, there is no success like failure.

Consider a storage system designed to store data and publish regular proofs of storage as promised, but with one catch: it would never return that data if the customer actually requested it. (From the customer perspective: you have backups but unbeknownst to you, they are unrecoverable.) Why would this design be more profitable? Because streaming a terabyte back to the customer is dominated by an entirely different type of operational expense than storing that terabyte in the first place: network bandwidth. It may well be profitable to set up a data storage operation in the middle-of-nowhere with cheap real-estate, abundant power but expensive bandwidth. Keeping data in storage while publishing the occasional proof involves very little bandwidth, because proof-of-storage protocols are very efficient in space. The only problem comes up if the customer actually wants their entire data streamed back. At that point a different cost structure involving network bandwidth comes into play and it may well be more profitable to walk away.

To make this more concrete: at the time of writing AWS charges ~1¢ per gigabyte per month for “infrequently accessed” data but 9¢ per gigabyte of data outbound over the network. Conveniently inbound traffic is free; uploading data to AWS costs nothing. As long as prevailing Filecoin market price is higher than S3 prices, one can operate a Filecoin storage miner on AWS to arbitrage the difference— this is exactly what DropBox used to do before figuring out how to operate its own datacenter. The only problem with this model is if the customer comes calling for their data too early or too often. In that case the bandwidth costs may well disrupt the profitability equation. If streaming the data back would lose money overall on that contract, the rational choice is to walk away.

Walking away from the contract for profit

Recall that storage providers are paid in funny money, namely the utility token associated with the blockchain. That currency is unlikely to work for purchasing anything in the real world and must be converted into dollars, euros or some other unit of measure accepted by the utility company to keep the datacenter lights on. That conversion in turn hinges on a volatile exchange rate. While there are reasonably mature markets in major cryptocurrencies such as Bitcoin and Ethereum, the tail-end of the ICO landscape is characterized by thin order-books and highly speculative trading. Against the backdrop of what amounts to an extreme version of FX risk, the service provider enters into a contract to store data for an extended period of time, effectively betting that the economics will work out. It need not be profitable today but perhaps it is projected to become profitable in the near future based on rosy forecasts of prices going to the moon. What happens if that bet proves incorrect? Again the rational choice is to walk away from the contract and drop all customer data.

For that matter, what happens when a better opportunity comes along? Suppose the exchange rate is stable or those risks are managed using a stablecoin while the market value of storage increases. Buyers are willing to pay more of the native currency per byte of data stashed away. Or another blockchain comes along, promising more profitable utilization of spare disk capacity. That may seem like great news for the storage provider except for one problem: they are stuck with existing customers paying lower rates negotiated earlier. Optimal choice is to renege on those commitments: delete existing customer data and reallocate the scarce space to higher-paying customers.

It is not clear if blockchain incentives can be tweaked to discourage this without creating unfavorable dynamics for honest service providers. Suppose we impose penalties on providers for abandoning storage contracts midway. These penalties can not be clawback provisions for past payments. The provider may well have already spent that money to cover operational expenses. For the same reason, it is not feasible to withhold payment until the very end of the contract period, without creating the risk that the buyer may walk away. Another option is requiring service providers to put up a surety bond. Before they are allow to participate in the ecosystem, they must set aside a lump sum on the blockchain held in escrow. These funds would be used to compensate any customers harmed by failure to honor storage contracts. But this has the effect of creating additional barriers to entry and locking away capital in a very unproductive way. Similarly the idea of taking monetary damages out of  future earnings does not work. It seems plausible that if a service provider screws over Alice because Bob offered a better price, recurring fees paid by Bob should be diverted to compensate Alice. But the service provider can trivially circumvent that penalty while still doing business with Bob: just start over with a new identity completely unlinkable to that “other” provider who screwed over Alice. To paraphrase the New Yorker cartoon on identity: on the blockchain nobody knows you are a crook.

Reputation revisisted

Readers may object: surely such an operation will go out of business once the market recognizes their modus operandi and no one is willing to entrust them with storing data? Aside from the fact that lack of identity on a blockchain renders it meaningless to go out-of-business, this posits there is such a thing as “reputation” buyers take into account when making decisions. The whole point of operating a storage market on chain is to allow customers to select the lowest bidder while relying on the smart-contract logic to guarantee that both sides hold up their side of the bargain. But if we are to invoke some fuzzy notion of reputation as selection criteria for service providers, why bother with a blockchain? Amazon, MSFT and Google have stellar reputations in delivering high-reliability, low-cost storage with  no horror stories of customers randomly getting ripped-off because Google decided one day it would be more profitable to drop all of their files. Not to mention, legions of plaintiffs’ attorneys would be having a field day with any US company that reneges on contracts in such cavalier fashion, assuming a newly awakened FTC does not get in on the action first. There is no reason to accept the inefficiencies of a blockchain or invoke elaborate cryptographic proofs if reputation is a good enough proxy.



Filecoin, StorJ and the problem with decentralized storage (part I)

Blockchains for everything

Decentralized storage services such as Filecoin and StorJ seek to disrupt the data-storage industry, using blockchain tokens to create a competitive marketplace that can offer more space at lower cost. They also promise to bring a veneer of legitimacy to the Initial Coin Offering (ICO) space. At a time when ICOs were being mass-produced as thinly-veiled, speculative investment vehicles that are likely to run afoul of the Howey test as unregistered securities, file-storage looks like a shining example of an actual utility tokens, for having some utility. Instead of betting on the “greater fool” theory of offloading the token on the next person willing to pay a higher price, these tokens are good for a useful service: paying someone else to store your backups. This blog post looks at some caveats and overlooked problems in the design.

Red-herring: privacy

A good starting point is to dispel the alleged privacy advantage. Decentralized storage system often tout their privacy advantage: data is stored encrypted by its owner, such that the storage provider can not read it even if they wanted to. That may seem like an improvement over the current low bar which relies on service providers swearing on a stack of pre-IPO shares that, pinky-promise, that they never not dip into customer data for business advantage, a promise more often honored in the breach as the examples of Facebook and Google repeatedly demonstrate. But there is no reason to fundamentally alter the data-storage model to achieve E2E security  against rogue providers. While far from being the path of least resistance, there is a long history of alternative remote backup services such as tarsnap for privacy-conscious users. (All 17 of them.) Previous blog posts here have demonstrated that it is possible to implement bring-your-own-encryption with vanilla cloud storage services such as AWS such that the cloud service is a glorified remote drive storing random noise it can not make sense of. These models are far more flexible than arbitrary, one-size-fits-all encryption model hard-coded into protocols such as StorJ. Users are free to adopt their preferred scheme, compatible with their existing key management model. For example with AWS Storage Gateway, Linux users can treat cloud storage as an iSCSI volume with LUKS encryption while those on Windows can apply Bitlocker-To-Go to protect that volume exactly as they would encrypt a USB thumb-drive. Backing up rarely accessed data in an enterprise is even easier: nothing more fancy than scripts to GPG-sign and encrypt backups before uploading them to AWS/Azure/GCP is necessary.

Facing the competition

Once we accept the premise that privacy alone can not be a differentiator for backup services—users can already solve that problem without depending on the service provider—the competitive landscape reverts to that of a commodity service. Roughly speaking, providers compete on three dimensions: reliability, cost and speed.

  • Cost is the price paid for storing each gigabyte of data for a given period of time.
  • Speed refers to how quickly that data can be downloaded when necessary and to a lesser extent, how quickly it can be uploaded during the backup process.
  • Reliability is the probability of being able to get all of your data back whenever you need it. A company that retains 99.999% of customer data while irreversibly losing the remaining 0.001% will not stay in business long. Even 100% retention rate is not great if the service only operates from 9AM-4PM.

The economic argument against decentralized storage can be stated this way: it is very unlikely that a decentralized storage market can offer an alternative that can compete against centralized providers— AWS, Google, Azure— when measured on any of these dimensions. (Of course nothing prevents Amazon or MSFT from participating in the decentralized marketplace to sell storage, but this would be another example of doing with increased friction something on a blockchain that can be done much more efficiently via existing channels.)

Among the three criteria, cost is easiest one to forecast. Here is the pitch from StorJ website:

“Have unused hard drive capacity and bandwidth?
Storj pays you for your unused hard drive capacity and bandwidth in STORJ tokens!”

Cloud services are ruled by a ruthless economy of scales. This is where Amazon, Google, MSFT and a host of other cloud providers shine, reaping the benefits of investment in data-centers and petabytes of storage capacity. Even if we ignore the question of reliability, it is very unlikely that the hobbyist with a few spare drives sitting in their basement can have a lower, per gigabyte cost.

The standard response to this criticism is pointing out that decentralized storage can unlock spare, unused capacity at zero marginal cost. Returning to our hypothetical hobbyist, he need not add new capacity to compete with AWS. Let us assume he already owns excess storage already paid for that sits underutilized; there is only so much space you can take up with vacation pictures. Disks consume about the same energy whether they are 99% of 1% full. Since the user is currently getting paid exactly $0 for that spare capacity, any value above zero is a good deal, according to this logic. In that case, any non-zero price point is achievable, including one that undercuts even the most cost-effective cloud provider. Our hobbyist can temporarily boot up those ancient PCs, stash away data someone on the other side of the world is willing to pay to safeguard and shutdown the computer once the backups are written. The equipment remains unplugged from the wall until such time as the buyer comes calling for their data.

Proof-of-storage and cost of storage

The problem with this model is that decentralized storage demands much more than mere inert storage of bits. They must achieve reliability in the absence of the usual contractual relationship, namely, someone you can sue for damages if the data disappears. Instead the blockchain itself must enforce fairness in the transaction: the service provider gets paid only if they are actually storing the data entrusted for safeguarding. Otherwise the provider could pocket the payment, discard uploaded data and put that precious disk space to some other use. Solving this problem requires a cryptographic technique called proofs-of-data-possession (PDP) or alternatively proofs-of-storage. Providers periodically run a specific computation over the data they promised to store— a computation that is only possible if they still have 100% of that data— and publish the results on the blockchain, which in turn facilitates payment conditional on periodic proofs. Because the data-owner can observe these proofs, they are assured their their precious data is still around. The key property is that the owner does not need access to the original file to check correctness: only a small “fingerprint” about the uploaded data is retained. That in a nutshell is the point of proof-of-storage; if the owner needed access to the entire dataset to verify the calculation, it would defeat the point of outsourcing storage.

While proofs of storage may keep service providers honest, it breaks one of the assumptions underlying the claimed economic advantage: leveraging idle capacity. Once we demand periodically going over the bits and running cryptographic calculations, the storage architecture can not be an ancient PC unplugged from the wall. There is a non-zero marginal cost to implementing proof-of-storage. In fact there is an inverse relationship between latency and price. Tape archives sitting on a shelf a much lower cost per gigabyte than spinning disks attached to a server. These tradeoffs are even reflected in the pricing model charged by Amazon: AWS offers a storage tier called Glacier which is considerably cheaper than S3 but comes with significant latency— on the order of hours— for accessing data. Requiring periodic proof-of-storage undermines  precisely the one model— offline media gathering dust in a vault— that has the best chance of undercutting large-scale centralized providers.

Beyond the economics, there is a more subtle problem with proof-of-storage: knowing your data is there does not mean that you can get it back when needed. This is the subject of the next blog post.



Off-by-one: the curious case of 2047-bit RSA keys

This is the story of an implementation bug discovered while operating an enterprise public-key infrastructure system. It is common in high-security scenarios for private keys to be stored on dedicated cryptographic hardware rather than managed as ordinary files on the commodity operating system. Smart-cards, hardware tokens and TPMs are examples of popular form factors. In this deployment, every employee was issued a USB token designed to connect to old-fashioned USB-A ports. USB tokens have a usability advantage over smart-cards in situations when most employees are using laptops: there is no separate card reader required, eliminating one piece carry lug around. The gadget presents itself to the operating system as a combined card reader with a card always present. Cards on the other hand have an edge for “converged access” scenarios involving both logical and physical access. Dual-interface cards with NFC can also be tapped against badge readers to open doors. (While it is possible to shoe-horn NFC into compact gadgets and this has been done, physical constraints on antenna size all but guarantee poor RF performance. Not to mention one decidedly low-tech but crucial aspect of an identity badge is having enough real-estate for the obligatory photograph and name of the employee spelled out in legible font.)

The first indication of something awry with the type of token used came from a simple utility rejecting the RSA public-key for a team member. That public-key had been part of a pair generated on the token, in keeping with the usual provisioning process that guarantees keys live on the token throughout their entire lifecycle. To recap that sequence:

  • Generate a key-pair of the desired characteristics, in this case 2048-bit RSA. This can be surprisingly slow with RSA on the order of 30 seconds, considering the hardware in question is powered by relatively modest SoCs under the hood.
  • Sign a certificate signing-request (CSR) containing the public-key. This is commonly done as a single operation at the time of key-generation, due to an implementation quirk: most card standards such as PIV require a certificate present before clients can use the card because they do not know have a way to identify private-keys in isolation.
  • Submit that CSR to the enterprise certificate authority to obtain a certificate. In principle certificates can be issued out of thin air. In reality most CA software can only accept a CSR containing the public-key of the subject, signed with the corresponding private key— and they will verify that criteria.
  • Load issued certificate on the token. At this point the token is ready for use in any scenario demanding PKI credentials, be it VPN, TLS client authentication in a web-browser, login to the operating system or disk-encryption.

On this particular token, that sequence resulted in a 2047-bit RSA key, one bit off the mark and falling short of the NIST recommendations to boot. A quick glance showed the provisioning process was not at fault. Key generation was executed on Windows using the tried-and-true certreq utility.  (Provisioning new credentials is commonly under-specified compared to steady-state usage of existing credentials, and vendors often deign to publish software for Windows only.) That utility takes an INF file as configuration specifying key type to generate. Quick glance at the INF file showed the number 2048 had not bit-rotted into 2047.

Something else lower in the stack was ignoring those instructions or failing to generate keys according to the specifications. Looking through other public-keys in the system showed that this was not in fact an isolated case. The culprit appeared to be the key-generation logic on the card itself.

Recall that when we speak of a “2048-bit RSA key” the counts are referring to the size of the modulus. An RSA modulus is the product of two large primes of comparable size. Generating a 2048-bit RSA key then is done by generating two random primes of half that size at 1024-bits and multiplying those two values together.

There is one catch with this logic: the product of two numbers N-bits in size each is not guaranteed to be 2·N. That intuitive-sounding 2·N result is an upper-bound: the actual product can be 2·N or 2·N-1 bits. Here is an example involving tractable numbers and the more familiar decimal notation. It takes two digits to express the prime numbers 19 and 29. But multiplying them we get 19 * 29 = 551, a number spanning three digits instead of four. By contrast the product of two-digit primes 37 and 59 is 2183, which is four digits as expected.

Informally, we can say not all N-bit numbers are alike. Some are “small,” meaning they are close to the lower bound of 2n-1, the smallest possible N-bit number. At the other end of the spectrum are “large” N-bit numbers, closer to the high end of the permissible range at 2ⁿ. Multiplying large N-bit numbers produces the expected 2N-bit product, while multiplying small ones can fall short of the goal.

RSA implementations commonly correct for this by setting some of the leading bits of the prime to 1, forcing each generated prime to be “large.” In other words, the random primes are not randomly selected from the full interval [2n-1, 2n – 1] but a tighter interval that excludes “small” primes. (Why not roll the dice on the full interval and check the product after the fact? See earlier point about the time-consuming nature of RSA key generation. Starting over from scratch is expensive.) This is effectively an extension of logic that is already present for prime generation,  namely setting the most significant bit to one. Otherwise the naive way to “choose random N-bit prime” by considering the entire interval [0, 2n – 1] can result in a much shorter prime, one that begins with an unfortunate run of leading zeroes. That guarantees failure: if one of the factors is strictly less than N bits, the final modulus can never hit the target of 2N bits.

So we know this token has a design flaw in prime generation that occasionally outputs 2047-bit modulus when asked for 2048. How occasional? If it were only setting the MSB to one and otherwise picking primes uniformly in the permissible interval, the error rate can be approximated by the probability that two random variables X and Y selected independently at random from the range [1, 2] have a product less than 2. Visualized geometrically, this is the area under the curve xy < 2 in a square region defined by sides in the same interval. That is a standard calculus problem that can be solved by integration. It predicts about 40% of RSA modulus falling short by one bit. That fraction is not consistent with the observed frequency which is closer to 1 in 10, an unlikely outcome from a Bayesian perspective if that was an accurate model for what the token is doing. (Note that if two leading bits were forced set on both primes, the error case is completely eliminated. From that perspective, the manufacturer was “off-by-one” according to more than one meaning of the phrase.)

So how damaging is this particular quirk to the security of RSA keys? It is certainly an embarrassing by virtue of how easy it should have been to catch this during testing. That does not reflect well on the quality assurance. Yet the difficulty of factoring 2047-bit keys is only marginally lower than that for full 2048-bits— which is to say far outside the range of currently known algorithms and computing power available to anyone outside the NSA. (Looked another way, forcing another bit to 1 when generating the primes also reduces the entropy.) Assuming this is the only bug in RSA key generation, there is no reason to throw away these tokens or lash out at the vendor. Also to be clear: this particular token was not susceptible to the ROCA vulnerability that affected all hardware using Infineon chips. In contrast to missing one from the generated key, Infineon RSA library produced full-size keys that were in fact much weaker than they appeared due to the special structure of the primes. That wreaked havoc on many large-scale systems, including latest generation Yubicrap (after a dubious switch from NXP to Infineon hardware) and the Estonian government electronic ID card system. In fact the irony of ROCA is proving that key length is far from being the only criteria for security. Due to the variable strategies used by Infineon to generate primes at different lengths, customer were better off using shorter RSA keys on the vulnerable hardware:

Screen Shot 2019-12-04 at 10.09.50 AM.png

(Figure 1, taken from the paper “The Return of Coppersmith’s Attack: Practical Factorization of Widely Used RSA Moduli”)

The counter-intuitive nature of ROCA is that the estimated worst-case factorization time (marked by the blue crosses above) does not increase in an orderly manner with key length. Instead there are sharp drops around 1000 and 2000 bits creating a sweet spot for the attack where the cost of recovering keys is drastically lower. Meanwhile the regions shaded yellow and orange correspond to key lengths where the attack is not feasible. To pick one example from the above graph, 1920-bit keys would not have been vulnerable to the factorization scheme described in the paper. Even 1800-bit keys would have been a better choice than the NIST-anointed choice of 2048. While 1800 keys were still susceptible to the attack, it would have required too much computing power—note the Y-axis has logarithmic scale— while 2048-bit keys were well within range of factoring with commodity hardware that can be leased from AWS.

It turns out that sometimes it is better missing the mark by one bit than hitting the target dead-on with the wrong algorithm.


Airbnb reviews and the prisoner’s dilemma

Reputation in the sharing economy

[Full disclosure: this blogger was head of information security for Airbnb 2013-2014]

In a recently published Vice article titled “I Accidentally Uncovered a Nationwide Scam on Airbnb” a journalist goes down the rabbit-hole of tracking down instances of fraud at scale on the popular sharing economy platform. The scam hinges on misrepresentation: unsuspecting guests sign up for one listing based on the photographs, only to be informed minutes before their check-in time about an unforeseen problem with unit that precludes staying there. Instead the crooks running the scam directs the guest to an allegedly better or more spacious unit also owned by the host. As expected, this bait-and-switch does not turn out very well for the guest, who discover upon arrival that their new lodgings are less than stellar: run-down, unsanitary and in some cases outright dangerous.

First, there is no excuse for the failure to crack down on these crooks. As the headline makes clear, this is not an isolated incident. Multiple guests were tricked by the same crook in exactly the same manner. In an impressive bit of sleuthing, the Vice journalist proceeds to identify multiple listings on the website using staged pictures with the same furniture and reach out to other guests conned by the same perpetrator. (She even succeeds in digging up property records for the building where the guests are routed after their original listing mysteriously becomes unavailable, identifying the owner and his company on LinkedIn.) Airbnb is not a struggling early stage stratup. It has ample financial resources to implement basic quality assurance: every listing must be inspected in person to confirm that its online depiction does not contain materially significant misrepresentations. The funds used to fight against housing ordinances or insult public libraries in San Francisco are better off redirected to combatting fraud or compensating affected customers. Ironically the company exhibited such a high-touch approach in its early days while it was far more constrained in workforce and cash: employees would personally visit hosts around the country to take professional photographs of their listings.

Second, even if one accepts the premise that 100% prevention is not possible— point-in-time inspection does not guarantee the host will continue to maintain the same standards— there is no excuse for appalling response from customer support. One would expect that guests are fully refunded for the cost of their stay or better yet, that Airbnb customer support can locate alternative lodgings in the same location in real time once guests discover the bait-and-switch. These guests were not staying in some remote island with few options; at least some of the recurring fraud took place in large, metropolitan areas such as Chicago where the platform boasts thousands of listings to choose from. Failing all else, Airbnb can always swallow its pride and book the guest into a hotel. Instead affected guests are asked to navigate a Kafkaesque dispute resolution process to get their money back even for one night of their stay. In one case the company informs the guest that the “host”— in other words, the crooks running this large-scale fraudulent enterprise— have a right to respond before customer support can take action.

Third, the article points to troubling failures of identity verification on the platform, or at least identity misrepresentation. It is one thing for users of social networks to get by with pseudonyms and nicknames. A sharing platform premised on the idea that strangers will be invited into each others’ place of residence is the one place where verified, real-world identity is crucial for deterring misconduct. If there is a listing hosted by “Becky” and “Andrew,” customers have every reason to believe that there are individuals named Becky and Andrew involved with that listing in some capacity. The smiling faces in the picture need not necessarily be the property owners or current lease-holder living there. They could be agents of the owner helping manage the listing or even professional employees at a company that specializes in brokering short-term rentals on Airbnb. But there is every expectation that such individuals exist, along with a phone number where they can be reached— otherwise, what is the point of collecting this information? Instead as the article shows, they appear to be fictitious couples with profile pictures scraped from a stock-photography website. The deception was in plain sight: an Airbnb review from 2012 referred to the individual behind the profile by his true name, not the fabricated couple identity. While there is an argument for using shortened versions, diminutives, middle-names or Anglicized names instead of the “legal” first name printed on official government ID, participants should not be allowed to make arbitrary changes to an existing verified profile.

To be clear: identity verification can not necessarily stop bad actors from joining the platform any more than the receptionist’s perfunctory request for driver’s license stops criminals from staying at hotels. People can and do commit crimes under their true identity. One could argue that Airbnb ought to run a background check on customers and reject those with prior convictions for violent offenses. Aside from being obviously detrimental to the company bottom line and possibly even running afoul of laws against discrimination (not that violating laws has been much of a deterrent for Airbnb) such an approach is difficult to apply globally. It is only for US residents that a wealth of information can be purchased on individuals, conveniently indexed by their social security number. More to the point, there is no “precrime unit” a la The Minority Report for predicting whether an individual with an otherwise spotless record will misbehave in the future once admitted on to the platform.

Far more important is to respond swiftly and decisively once misbehavior are identified, in order to guarantee the miscreants will never be able to join the platform again under some other disguise. At the risk of sounding like the nightmarish social-credit system being imposed in China as an instrument of autocratic control, one could envision a common rating system for the sharing economy: if you are kicked out of Airbnb for defrauding guests, you are also prevented from signing up for Lyft. (Fear not, Uber will likely accept you anyway.) In this case a single perpetrator brazenly operated multiple accounts on the platform, repeatedly bait-and-switching guests over to units in the same building he owned, leaving behind an unmistakable trail of disgruntled guest reviews. Airbnb still could not connect the dots.

The problem with peer-reviews

Finally and this is the most troubling aspect, the article suggests the incentive system for reviews is not working as intended. In a functioning market, peer reviews elicit honest feedback and accurately represent the reputation of participants. The article points to several instances where guests inconvenienced by fraudulent listings were either reluctant to leave negative feedback. Even worse, there were situations when the perpetrators of the scams left scathing reviews full of fabrications for the guests, in an effort to cast doubt on the credibility of the understandably negative reviews those guests were expected to leave.

Incidentally, Airbnb did change its review system around 2014 to better incentivize both parties to provide honest feedback without worrying about what their counterparty will say. Prior to 2014, reviews were made publicly visible as soon as the guest or host provided them in the system. This created a dilemma: both sides were incentivized to wait for the other to complete their review first, so they could adjust their feedback accordingly. For example, if guests are willing to overlook minor issues with the listing, the host may be willing to forgive of some their minor transgressions. But if the guest review consisted of nitpicking about every problem with the listing (“too few coffee mugs— what is wrong with this place?”) the host will be inclined to view guest conduct through an equally harsh lens (“they did not separate the recycling— irresponsible people”) That creates an incentive for providing mostly anodyne, meaningless feedback and avoiding confrontation at all costs. After all, the side that writes a negative review first is at a distinct disadvantage. Their counterparty can write an even harsher response, not only responding to the original criticism but also piling on far more serious criticisms against the author. It also means that reviews may take longer to arrive. When neither side wants to go first, the result is a game-of-chicken between guest & host played against the review deadline.

In the new review model, feedback is hidden until both sides complete their reviews. After that point, it is revealed simultaneously. That means both sides are required to provide feedback independently, without visibility into what their counterparty wrote ahead of time. In theory this elicits more honest reviews— there is no incentive to suppress negative feedback out of a concern that the other side will modify their review in response. (There is still a 30-day deadline to make sure feedback is provided in a timely manner; otherwise either side could permanently hold the reviews hostage.) The situation is similar to the prisoner’s dilemma from game theory: imagine both guest and host having grievances about a particular stay. The optimal outcome from a reputation perspective is one where both sides suppress the negative feedback (“cooperate”) leaving positive reviews, which looks great for everyone— and Airbnb. But if one side defects and leaves a negative review featuring their grievance, the other side will look even worse. Imagine a scenario where the guests say everything was great about the listing and host, while the host claims the guests were terrible people and demands payment from Airbnb for the damage. Even if these charges were fabricated, the guests have lost much of their credibility to counter the false accusations by going on the record with a glowing review about the host. So the stable strategy is to “defect:” include negative feedback in the review, expecting that the counterparty will likewise include their own version of the same grievance.

But game theoretical outcomes are only observed in the real world when participants follow the optimal strategies expected of rational agents. Decades of behavioral economics research suggest that actual choices made by humans can deviate significantly from that ideal. The Vice article quotes guests who were reluctant to leave negative reviews about the fraudulent hosts even after their decidedly unhappy experiences. This is not surprising either. There are other considerations that go into providing feedback beyond fear of retaliation. For example there are social norms against harshly criticizing other people; recall that all reviews are visible on Airbnb. Other users can look up a prospective guest and observe that he/she has been providing 1 star reviews  to all of their hosts. In the absence of such constraints, the game-theoretical conclusion would be taken to an extreme where both sides write the most negative review possible, constrained only by another social norm against making false statements.

Either way, the incentive structure for reviews clearly needs some tweaks to elicit accurate feedback.


Update: Airbnb announced that the company will be manually verifying all 7 million of their listings.

Cloud storage with end-to-end encryption: AWS Storage Gateway (part III)

[continued from part II]

The final post in this series pushes the threat model into tinfoil-hat territory: we posit Amazon going rogue or being compelled to recover private, encrypted data associated with users of the AWS Storage Gateway service.

Data corruption: attacks against integrity

An obvious avenue for Amazon to attack customers is by modifying gateway software to return corrupted data. As noted earlier, full-disk encryption schemes operate under a strict constraint that 1 block of plaintext must encrypt to exactly 1 block of ciphertext. No expansion of data is allowed. This rules out use of integrity checks to detect corruption of ciphertext. Every block returned by the gateway will decrypt to something; that could be the original data stored by the customer— or not. At most FDE can guarantee is that without the encryption key, Amazon can not return a ciphertext crafted to yield some plaintext of their choosing.

What can be achieved in this limited attacker model is highly dependent on content type. Scrambling a few pixels in a picture or ruining the beat for a few seconds in a song is unlikely to ruin anyone’s day. On the other hand, randomly changing figures in a spreadsheet used for financial reporting may have downstream implications— although such files have additional structure that will likely break when a 512-byte sector is perturbed. If remote storage is used for storing executable files, worst-case scenario is arbitrary code execution. Yet achieving that requires a level of control over data corruption that is unlikely with FDE. Replacing a block of code with random x86 instructions will result in a crash with high probability. 

More interesting potential targets are files holding configuration data which influence how software behaves. Again it is possible to come up with contrived examples where even a random change could alter system security level. For example consider a binary format where one field is assumed to hold a boolean flag determining whether a dangerous feature such as macros or debugging are enabled. A value of zero indicates off, any other value stands for on. With very high probability any change to ciphertext will flip this flag from “off” to “on,” enabling a potentially exploitable feature. (Granted exploiting that is still non-trivial: attacker would have to know the exact location of the file on disk along with offset of that critical field inside the file. Recall that the concept of a “file” does not exist at iSCSI layer: cloud storage provider only observes requests for reading and writing blocks containing encrypted data. Metadata including file names is not visible, although some properties such as existence of a file spanning particular blocks could be inferred from access patterns.)

Lateral movement: attacks against customer systems

A more promising attack strategy is Amazon pushing out a malicious update to the gateway VM that seeks to escalate privileges outside the gateway itself, taking over other systems in the customer environment. Objective: discover the secret key used for full-disk encryption by compromising the system that applies FDE over the raw block device. That attack vector exists even if a TPM or smart-card is used for key management; at the end of the day there is a symmetric key released to the FDE implementation, regardless of whether that secret was originally wrapped by a password typed by the user or cryptographic hardware.

Considering the gateway VM as a malicious system seeking lateral movement, we find that it has few resource available. These VMs require very little integration with the surrounding environment: they do not need to be joined to an Active Directory, they do not SSH into any other customer system or access resources from a file share. In fact it can be isolated from the environment with the exception of inbound ports for iSCSI ports and external communication to AWS. Focusing on inbound connections, only a handful of other customers systems interact with the VM and specifically over iSCSI interface. It would take a remote-code execution bug in the iSCSI initiator to allow lateral movement, by having the iSCSI target return iSCSI responses to the vulnerable initiator.

A more likely target is the physical machine hosting the gateway VM. While not common, vulnerabilities are periodically discovered in hypervisors that would permit guest VMs to “break out” and take over the host. (It is also a safe assumption that Amazon could conduct reconnaissance to identify which hypervisor their customer is running and choose an appropriate exploit, not wasting a VMware 0-day on a customer running Hyper-V) This risk is much higher for the simple deployment model, where gateway VM is colocated with the second machine (virtual or physical) that implements full-disk encryption. In that scenario, a full VM escape may not even be required: micro-architectural side channel attacks could permit AWS to recover an AES key or plaintext lying around in memory from computations performed by the colocated FDE implementation. Hosting the VM on dedicated hardware physically separated from the machine that implements full disk encryption mitigates this risk.

One deterring factor against these tactics— in addition to their long odds of success—is they would be very noisy. Deploying a malicious update or serving a tampered iSCSI volume to the gateway VM leaves trails on local disk. Amazon can not rule out the possibility that the customer is making periodic backups of their VM or copying remote files to another local device outside Amazon control. That would result in the customer having a permanent record of an attempted attack for future forensics.

Economics: cost of privacy

To summarize: AWS Storage Gateway provides a pragmatic model for using cloud-storage services for private storage— as a glorified drive storing encrypted bits, with no way for the service provider to decrypt those bits. This is in marked contrast from the standard model of cloud storage where providers have full visibility into customer data, and either promise to not to peek or only peek for specific business objectives such as advertising. As with all DIY projects, there are tradeoffs to achieving that level of privacy. From a cost perspective, there is no charge for using a gateway appliance per se. instead AWS charges standard S3 storage rates for the space used, along with standard EC2 rates for outbound data transfer from AWS cloud to the gateway. Inbound bandwidth from the gateway appliance to the cloud for backing up modified data is free, but there is a charge per-GB for new data written to the volume. To pick one data point, keeping 10TB of private storage of which 10% gets modified every month would cost about two to three times more compared to a standard consumer offering such as Google Drive, depending on AWS region since Amazon prices vary geographically. That does not include the cost for hardware and operational expenses for running the gateway appliance itself. That incremental expense may vary from nearly zero— when adding one more VM to an existing load that already runs 24/7— to significant if new hardware must be acquired for hosting the gateway appliance.

Unlike other DIY projects such as hosting your own VPN service where AWS is less cost-effective than alternative hosting services, in this scenario AWS fares much better competitively. Cost of self-hosted VPN is largely dominated by bandwidth, where the AWS fee structure for charging by GB quickly loses against flat pricing models. But when it comes to using the cloud as a glorified disk drive storing bits, the specialized nature of AWS Storage Gateway has an edge. Replicating the same solution by attaching large SSD storage to a leased virtual server would neither cost effective or achieve a comparable level of redundancy as leveraging AWS.


Cloud storage with end-to-end encryption: AWS Storage Gateway (part II)

[continued from part I]

Greater flexibility for cloud storage can be achieved by deploying both the iSCSI target and the iSCSI initiator on one dedicated device on the local network. This could be a small form-factor PC running a hypervisor or even dedicated NAS device with virtualization support. The initiator would run a standard Windows or Linux image, encapsulating the remote storage and presenting a simpler interface such as network file share. For example, in the case of MSFT environments the VM could represent the iSCSI volume as an ordinary SMB share that other devices on the network can remotely mount and use. In this case read/write access to the underlying device is managed by the initiator VM. That second VM becomes responsible for handling concurrent access, specifically to avoid parallel writes from different clients clobbering the file-system. It would also present additional access controls based on Windows authentication at the level of individual files. (“File share” is a misnomer on Windows since the unit of storage shared is a directory or even entire drive.) This allows further compartmentalizing data: for example the media-player can be granted access to the directory containing the MP3 collection but not an adjacent directory that has contains backups of financial information. It’s worth pointing out that all of these abstractions— including that of individual “files” with permissions— only exists on the initiator VM. As far as the gateway and AWS is concerned, there is just one flat volume containing seemingly random bits; one of properties of an “ideal” cipher is that its output is indistinguishable from random data.

This model allows multiple devices to use cloud storage concurrently with one caveat: they must have network access to the device hosting the initiator/target VM combination. That still falls short of the “access your files from anywhere” goal already achieved by popular cloud storage services. For example if the VMs are hosted on a home network, those files remain inaccessible when the user is traveling, unless they have the capability to VPN into their home network. This restriction is less significant in an enterprise scenario since large enterprises typically have an internal network perimeter along with VPN services for employees to connect. (It may even be considered a “feature” that inadvertently implements a control popular with IT auditors: corporate data is only accessible from trusted networks.) There are two workarounds, both kludgy:

  • Mount cloud storage as read-only from a local gateway VM that can roam with the user— this is more tricky than it sounds. Depending on filesystem, the notion of whether a volume is “read-only” is itself part of the filesystem.
  • Create multiple iSCSI targets in AWS Storage Gateway, each one designed for exclusive access from a specific environment such as home network or specific laptop.

Privacy: what Amazon sees

Now we turn to the question of what the cloud storage provider can learn about the data owned by the customer in this setting. It is helpful to separate this into two distinct threat models:

  1. Passive observation or “honest-but-curious” adversary. Amazon behaves as promised and provides the storage service without going out of its way to undermine customer privacy. They have full visibility into what is going on in the cloud but not on the gateway VM. For example they can not pull system logs from the gateway or observe raw iSCSI traffic received.
  2. Active attacker: in this model anything goes— for example Amazon can send malicious updates to the gateway and tamper with contents after storage, returning different blocks than what the customer wrote. This will be taken up in the section on security.

Let’s start with the first problem. First we observe that any data sent to the cloud has been previously encrypted using a sound full-disk encryption scheme such as LUKS or Bitlocker. As long as this is strictly true, the bits stored at AWS provide no meaningful information about the original contents. There are two important qualifications to this.

First the volume must have been encrypted before any use or if encryption is added after-the-fact, it must have been applied to the entire volume. Otherwise the scheme risks exposing fragments of data leftover from previous, unencrypted content. For example Bitlocker will not encrypt unused space on disk by default, assuming that it is blank. That assumption does not hold if the disk has been in use— in that case “free space” on disk could actually hold leftover, unencrypted data from previously deleted files that is now exposed to cloud provider.

Conversely, when disk encryption is applied from the start but under the more generous assumption that unused space need not be encrypted, the “encrypted” disk image will contain a large number of all zero blocks. This allows the storage provider to observe exactly how much of the disk is being used at any given time. It also allows the provider to make inferences on new storage: for example if a new 1MB file is added to the filesystem, zero blocks totaling approximately that much space will be overwritten with random data. (Also note that changing this has an effect on the bottom line: ASG only charges for “used” space in the remote volume. Encrypting the entire disk including empty space will result in paying for total capacity out of the gate, instead of only the fraction in use.)

The previous observation about observing write-access in the cloud applies more broadly than merely distinguishing used vs unused space. Recall that gateway VMs synchronize writes to the cloud. Corollary: Amazon can observe which blocks on disk are being modified. While they have no idea what the contents of those blocks are, they could soon build a list of which blocks change more frequently and if there are specific access patterns such as block #5 always being written immediately after block #8. This alone may allow fingerprinting of remote operating system and applications; filesystems are constantly busy doing book-keeping on file meta-data even when clients are not explicitly writing. (For example, the last-access time of a file gets modified whenever the file is read.) Similarly background tasks such as those responsible for indexing disk contents for full-text search can leave tell-tale signs in their access pattern. Note that filesystem type is not hidden at all: both LUKS and Bitlocker use a well-defined header structure that would be visible to the cloud provider.

By contrast read operations are not visible to the cloud, being served directly from the local gateway VM. Amazon does not learn which blocks are requested since those requests are not propagated up to the cloud.

One bit of good news is that full disk encryption in this situation really means full-disk: unlike the case of encrypting boot volumes, here 100% of storage can be encrypted. By contrast a volume that contains the operating system must have some unencrypted partitions to make the boot sequence possible— including one containing the FDE implementation that can decrypt the rest of the operating system.

Security: what Amazon could see

The picture gets more complicated once we expand the threat model to take into account active attacks from AWS going rogue. In particular, all gateway software is controlled by Amazon and supports remote updates. Under normal circumstances they are considerate enough to give notice and ask for permission before applying updates:

Screen Shot 2019-07-20 at 10.12.33 AM.png

AWS Console makes it clear the company can remotely force update gateway VMs

But this is merely a courtesy; a conservative assumption is that the company can remotely install malicious software on any gateway anytime. This is not because of the restricted shell given by default— since the gateway is distributed as a VM, it is trivial to modify its virtual disk image and obtain root on the box for looking around:


AWS Storage Gateway uses a Java application and some glues scripts to present iSCSI interface to local clients while synchronizing contents to S3

A determined customer could install additional software or change the configuration of the VM, and possibly disable remote updates. But the problem is without fully reverse-engineering every piece of software on the gateway, it is still not possible to rule out the possibility of a backdoor inserted by AWS. For this reason, we will model the gateway VM as a blackbox controlled by the adversary.

In that threat model denial-of-service can not be prevented: adversary can always brick the gateway or decline to serve any data. But this is not any different from AWS carrying out the same attack in the cloud. Nothing prevents Amazon from deleting customer data from S3 or holding it hostage, other than the fear of PR reprisals or future impact on lost revenue.

Access patterns: ORAM threat model

With control of the gateway, Amazon can also observe read operations, in addition to observing writes from the vantage point of the cloud. This begins to look a lot like the oblivious RAM (ORAM) threat model in theoretical cryptography: an adversary observes all memory accesses made by a program and attempts to deduce information about the program— such as secret inputs— from that pattern alone. Quantifying such risks in the context of remote storage is difficult. It is common for volatile memory access to be dependent on secrets. For example, bits of an RSA private key are often used to index into a lookup table in memory, resulting in different access patterns based on the key. That property is directly exploitable for side-channel attacks. It is much less common to use persistent storage that way. Nevertheless there is information contained in the sequence of blocks read or written that the gateway learns. For example, it can highlight “hot spots” on disk containing frequently accessed contents. It may even be possible link an iSCSI volume to a particular service by observing a correlation between requests. Contrived example: suppose a customer is running a web-server backed by the iSCSI volume on a gateway. Sending HTTP requests to the server to download different files and observing resulting iSCSI requests to the gateway VM could build a convincing case that this site is serving up content straight out of the gateway.