Airbnb reviews and the prisoner’s dilemma

Reputation in the sharing economy

[Full disclosure: this blogger was head of information security for Airbnb 2013-2014]

In a recently published Vice article titled “I Accidentally Uncovered a Nationwide Scam on Airbnb” a journalist goes down the rabbit-hole of tracking down instances of fraud at scale on the popular sharing economy platform. The scam hinges on misrepresentation: unsuspecting guests sign up for one listing based on the photographs, only to be informed minutes before their check-in time about an unforeseen problem with unit that precludes staying there. Instead the crooks running the scam directs the guest to an allegedly better or more spacious unit also owned by the host. As expected, this bait-and-switch does not turn out very well for the guest, who discover upon arrival that their new lodgings are less than stellar: run-down, unsanitary and in some cases outright dangerous.

First, there is no excuse for the failure to crack down on these crooks. As the headline makes clear, this is not an isolated incident. Multiple guests were tricked by the same crook in exactly the same manner. In an impressive bit of sleuthing, the Vice journalist proceeds to identify multiple listings on the website using staged pictures with the same furniture and reach out to other guests conned by the same perpetrator. (She even succeeds in digging up property records for the building where the guests are routed after their original listing mysteriously becomes unavailable, identifying the owner and his company on LinkedIn.) Airbnb is not a struggling early stage stratup. It has ample financial resources to implement basic quality assurance: every listing must be inspected in person to confirm that its online depiction does not contain materially significant misrepresentations. The funds used to fight against housing ordinances or insult public libraries in San Francisco are better off redirected to combatting fraud or compensating affected customers. Ironically the company exhibited such a high-touch approach in its early days while it was far more constrained in workforce and cash: employees would personally visit hosts around the country to take professional photographs of their listings.

Second, even if one accepts the premise that 100% prevention is not possible— point-in-time inspection does not guarantee the host will continue to maintain the same standards— there is no excuse for appalling response from customer support. One would expect that guests are fully refunded for the cost of their stay or better yet, that Airbnb customer support can locate alternative lodgings in the same location in real time once guests discover the bait-and-switch. These guests were not staying in some remote island with few options; at least some of the recurring fraud took place in large, metropolitan areas such as Chicago where the platform boasts thousands of listings to choose from. Failing all else, Airbnb can always swallow its pride and book the guest into a hotel. Instead affected guests are asked to navigate a Kafkaesque dispute resolution process to get their money back even for one night of their stay. In one case the company informs the guest that the “host”— in other words, the crooks running this large-scale fraudulent enterprise— have a right to respond before customer support can take action.

Third, the article points to troubling failures of identity verification on the platform, or at least identity misrepresentation. It is one thing for users of social networks to get by with pseudonyms and nicknames. A sharing platform premised on the idea that strangers will be invited into each others’ place of residence is the one place where verified, real-world identity is crucial for deterring misconduct. If there is a listing hosted by “Becky” and “Andrew,” customers have every reason to believe that there are individuals named Becky and Andrew involved with that listing in some capacity. The smiling faces in the picture need not necessarily be the property owners or current lease-holder living there. They could be agents of the owner helping manage the listing or even professional employees at a company that specializes in brokering short-term rentals on Airbnb. But there is every expectation that such individuals exist, along with a phone number where they can be reached— otherwise, what is the point of collecting this information? Instead as the article shows, they appear to be fictitious couples with profile pictures scraped from a stock-photography website. The deception was in plain sight: an Airbnb review from 2012 referred to the individual behind the profile by his true name, not the fabricated couple identity. While there is an argument for using shortened versions, diminutives, middle-names or Anglicized names instead of the “legal” first name printed on official government ID, participants should not be allowed to make arbitrary changes to an existing verified profile.

To be clear: identity verification can not necessarily stop bad actors from joining the platform any more than the receptionist’s perfunctory request for driver’s license stops criminals from staying at hotels. People can and do commit crimes under their true identity. One could argue that Airbnb ought to run a background check on customers and reject those with prior convictions for violent offenses. Aside from being obviously detrimental to the company bottom line and possibly even running afoul of laws against discrimination (not that violating laws has been much of a deterrent for Airbnb) such an approach is difficult to apply globally. It is only for US residents that a wealth of information can be purchased on individuals, conveniently indexed by their social security number. More to the point, there is no “precrime unit” a la The Minority Report for predicting whether an individual with an otherwise spotless record will misbehave in the future once admitted on to the platform.

Far more important is to respond swiftly and decisively once misbehavior are identified, in order to guarantee the miscreants will never be able to join the platform again under some other disguise. At the risk of sounding like the nightmarish social-credit system being imposed in China as an instrument of autocratic control, one could envision a common rating system for the sharing economy: if you are kicked out of Airbnb for defrauding guests, you are also prevented from signing up for Lyft. (Fear not, Uber will likely accept you anyway.) In this case a single perpetrator brazenly operated multiple accounts on the platform, repeatedly bait-and-switching guests over to units in the same building he owned, leaving behind an unmistakable trail of disgruntled guest reviews. Airbnb still could not connect the dots.

The problem with peer-reviews

Finally and this is the most troubling aspect, the article suggests the incentive system for reviews is not working as intended. In a functioning market, peer reviews elicit honest feedback and accurately represent the reputation of participants. The article points to several instances where guests inconvenienced by fraudulent listings were either reluctant to leave negative feedback. Even worse, there were situations when the perpetrators of the scams left scathing reviews full of fabrications for the guests, in an effort to cast doubt on the credibility of the understandably negative reviews those guests were expected to leave.

Incidentally, Airbnb did change its review system around 2014 to better incentivize both parties to provide honest feedback without worrying about what their counterparty will say. Prior to 2014, reviews were made publicly visible as soon as the guest or host provided them in the system. This created a dilemma: both sides were incentivized to wait for the other to complete their review first, so they could adjust their feedback accordingly. For example, if guests are willing to overlook minor issues with the listing, the host may be willing to forgive of some their minor transgressions. But if the guest review consisted of nitpicking about every problem with the listing (“too few coffee mugs— what is wrong with this place?”) the host will be inclined to view guest conduct through an equally harsh lens (“they did not separate the recycling— irresponsible people”) That creates an incentive for providing mostly anodyne, meaningless feedback and avoiding confrontation at all costs. After all, the side that writes a negative review first is at a distinct disadvantage. Their counterparty can write an even harsher response, not only responding to the original criticism but also piling on far more serious criticisms against the author. It also means that reviews may take longer to arrive. When neither side wants to go first, the result is a game-of-chicken between guest & host played against the review deadline.

In the new review model, feedback is hidden until both sides complete their reviews. After that point, it is revealed simultaneously. That means both sides are required to provide feedback independently, without visibility into what their counterparty wrote ahead of time. In theory this elicits more honest reviews— there is no incentive to suppress negative feedback out of a concern that the other side will modify their review in response. (There is still a 30-day deadline to make sure feedback is provided in a timely manner; otherwise either side could permanently hold the reviews hostage.) The situation is similar to the prisoner’s dilemma from game theory: imagine both guest and host having grievances about a particular stay. The optimal outcome from a reputation perspective is one where both sides suppress the negative feedback (“cooperate”) leaving positive reviews, which looks great for everyone— and Airbnb. But if one side defects and leaves a negative review featuring their grievance, the other side will look even worse. Imagine a scenario where the guests say everything was great about the listing and host, while the host claims the guests were terrible people and demands payment from Airbnb for the damage. Even if these charges were fabricated, the guests have lost much of their credibility to counter the false accusations by going on the record with a glowing review about the host. So the stable strategy is to “defect:” include negative feedback in the review, expecting that the counterparty will likewise include their own version of the same grievance.

But game theoretical outcomes are only observed in the real world when participants follow the optimal strategies expected of rational agents. Decades of behavioral economics research suggest that actual choices made by humans can deviate significantly from that ideal. The Vice article quotes guests who were reluctant to leave negative reviews about the fraudulent hosts even after their decidedly unhappy experiences. This is not surprising either. There are other considerations that go into providing feedback beyond fear of retaliation. For example there are social norms against harshly criticizing other people; recall that all reviews are visible on Airbnb. Other users can look up a prospective guest and observe that he/she has been providing 1 star reviews  to all of their hosts. In the absence of such constraints, the game-theoretical conclusion would be taken to an extreme where both sides write the most negative review possible, constrained only by another social norm against making false statements.

Either way, the incentive structure for reviews clearly needs some tweaks to elicit accurate feedback.

CP

Update: Airbnb announced that the company will be manually verifying all 7 million of their listings.

Cloud storage with end-to-end encryption: AWS Storage Gateway (part III)

[continued from part II]

The final post in this series pushes the threat model into tinfoil-hat territory: we posit Amazon going rogue or being compelled to recover private, encrypted data associated with users of the AWS Storage Gateway service.

Data corruption: attacks against integrity

An obvious avenue for Amazon to attack customers is by modifying gateway software to return corrupted data. As noted earlier, full-disk encryption schemes operate under a strict constraint that 1 block of plaintext must encrypt to exactly 1 block of ciphertext. No expansion of data is allowed. This rules out use of integrity checks to detect corruption of ciphertext. Every block returned by the gateway will decrypt to something; that could be the original data stored by the customer— or not. At most FDE can guarantee is that without the encryption key, Amazon can not return a ciphertext crafted to yield some plaintext of their choosing.

What can be achieved in this limited attacker model is highly dependent on content type. Scrambling a few pixels in a picture or ruining the beat for a few seconds in a song is unlikely to ruin anyone’s day. On the other hand, randomly changing figures in a spreadsheet used for financial reporting may have downstream implications— although such files have additional structure that will likely break when a 512-byte sector is perturbed. If remote storage is used for storing executable files, worst-case scenario is arbitrary code execution. Yet achieving that requires a level of control over data corruption that is unlikely with FDE. Replacing a block of code with random x86 instructions will result in a crash with high probability. 

More interesting potential targets are files holding configuration data which influence how software behaves. Again it is possible to come up with contrived examples where even a random change could alter system security level. For example consider a binary format where one field is assumed to hold a boolean flag determining whether a dangerous feature such as macros or debugging are enabled. A value of zero indicates off, any other value stands for on. With very high probability any change to ciphertext will flip this flag from “off” to “on,” enabling a potentially exploitable feature. (Granted exploiting that is still non-trivial: attacker would have to know the exact location of the file on disk along with offset of that critical field inside the file. Recall that the concept of a “file” does not exist at iSCSI layer: cloud storage provider only observes requests for reading and writing blocks containing encrypted data. Metadata including file names is not visible, although some properties such as existence of a file spanning particular blocks could be inferred from access patterns.)

Lateral movement: attacks against customer systems

A more promising attack strategy is Amazon pushing out a malicious update to the gateway VM that seeks to escalate privileges outside the gateway itself, taking over other systems in the customer environment. Objective: discover the secret key used for full-disk encryption by compromising the system that applies FDE over the raw block device. That attack vector exists even if a TPM or smart-card is used for key management; at the end of the day there is a symmetric key released to the FDE implementation, regardless of whether that secret was originally wrapped by a password typed by the user or cryptographic hardware.

Considering the gateway VM as a malicious system seeking lateral movement, we find that it has few resource available. These VMs require very little integration with the surrounding environment: they do not need to be joined to an Active Directory, they do not SSH into any other customer system or access resources from a file share. In fact it can be isolated from the environment with the exception of inbound ports for iSCSI ports and external communication to AWS. Focusing on inbound connections, only a handful of other customers systems interact with the VM and specifically over iSCSI interface. It would take a remote-code execution bug in the iSCSI initiator to allow lateral movement, by having the iSCSI target return iSCSI responses to the vulnerable initiator.

A more likely target is the physical machine hosting the gateway VM. While not common, vulnerabilities are periodically discovered in hypervisors that would permit guest VMs to “break out” and take over the host. (It is also a safe assumption that Amazon could conduct reconnaissance to identify which hypervisor their customer is running and choose an appropriate exploit, not wasting a VMware 0-day on a customer running Hyper-V) This risk is much higher for the simple deployment model, where gateway VM is colocated with the second machine (virtual or physical) that implements full-disk encryption. In that scenario, a full VM escape may not even be required: micro-architectural side channel attacks could permit AWS to recover an AES key or plaintext lying around in memory from computations performed by the colocated FDE implementation. Hosting the VM on dedicated hardware physically separated from the machine that implements full disk encryption mitigates this risk.

One deterring factor against these tactics— in addition to their long odds of success—is they would be very noisy. Deploying a malicious update or serving a tampered iSCSI volume to the gateway VM leaves trails on local disk. Amazon can not rule out the possibility that the customer is making periodic backups of their VM or copying remote files to another local device outside Amazon control. That would result in the customer having a permanent record of an attempted attack for future forensics.

Economics: cost of privacy

To summarize: AWS Storage Gateway provides a pragmatic model for using cloud-storage services for private storage— as a glorified drive storing encrypted bits, with no way for the service provider to decrypt those bits. This is in marked contrast from the standard model of cloud storage where providers have full visibility into customer data, and either promise to not to peek or only peek for specific business objectives such as advertising. As with all DIY projects, there are tradeoffs to achieving that level of privacy. From a cost perspective, there is no charge for using a gateway appliance per se. instead AWS charges standard S3 storage rates for the space used, along with standard EC2 rates for outbound data transfer from AWS cloud to the gateway. Inbound bandwidth from the gateway appliance to the cloud for backing up modified data is free, but there is a charge per-GB for new data written to the volume. To pick one data point, keeping 10TB of private storage of which 10% gets modified every month would cost about two to three times more compared to a standard consumer offering such as Google Drive, depending on AWS region since Amazon prices vary geographically. That does not include the cost for hardware and operational expenses for running the gateway appliance itself. That incremental expense may vary from nearly zero— when adding one more VM to an existing load that already runs 24/7— to significant if new hardware must be acquired for hosting the gateway appliance.

Unlike other DIY projects such as hosting your own VPN service where AWS is less cost-effective than alternative hosting services, in this scenario AWS fares much better competitively. Cost of self-hosted VPN is largely dominated by bandwidth, where the AWS fee structure for charging by GB quickly loses against flat pricing models. But when it comes to using the cloud as a glorified disk drive storing bits, the specialized nature of AWS Storage Gateway has an edge. Replicating the same solution by attaching large SSD storage to a leased virtual server would neither cost effective or achieve a comparable level of redundancy as leveraging AWS.

CP

Cloud storage with end-to-end encryption: AWS Storage Gateway (part II)

[continued from part I]

Greater flexibility for cloud storage can be achieved by deploying both the iSCSI target and the iSCSI initiator on one dedicated device on the local network. This could be a small form-factor PC running a hypervisor or even dedicated NAS device with virtualization support. The initiator would run a standard Windows or Linux image, encapsulating the remote storage and presenting a simpler interface such as network file share. For example, in the case of MSFT environments the VM could represent the iSCSI volume as an ordinary SMB share that other devices on the network can remotely mount and use. In this case read/write access to the underlying device is managed by the initiator VM. That second VM becomes responsible for handling concurrent access, specifically to avoid parallel writes from different clients clobbering the file-system. It would also present additional access controls based on Windows authentication at the level of individual files. (“File share” is a misnomer on Windows since the unit of storage shared is a directory or even entire drive.) This allows further compartmentalizing data: for example the media-player can be granted access to the directory containing the MP3 collection but not an adjacent directory that has contains backups of financial information. It’s worth pointing out that all of these abstractions— including that of individual “files” with permissions— only exists on the initiator VM. As far as the gateway and AWS is concerned, there is just one flat volume containing seemingly random bits; one of properties of an “ideal” cipher is that its output is indistinguishable from random data.

This model allows multiple devices to use cloud storage concurrently with one caveat: they must have network access to the device hosting the initiator/target VM combination. That still falls short of the “access your files from anywhere” goal already achieved by popular cloud storage services. For example if the VMs are hosted on a home network, those files remain inaccessible when the user is traveling, unless they have the capability to VPN into their home network. This restriction is less significant in an enterprise scenario since large enterprises typically have an internal network perimeter along with VPN services for employees to connect. (It may even be considered a “feature” that inadvertently implements a control popular with IT auditors: corporate data is only accessible from trusted networks.) There are two workarounds, both kludgy:

  • Mount cloud storage as read-only from a local gateway VM that can roam with the user— this is more tricky than it sounds. Depending on filesystem, the notion of whether a volume is “read-only” is itself part of the filesystem.
  • Create multiple iSCSI targets in AWS Storage Gateway, each one designed for exclusive access from a specific environment such as home network or specific laptop.

Privacy: what Amazon sees

Now we turn to the question of what the cloud storage provider can learn about the data owned by the customer in this setting. It is helpful to separate this into two distinct threat models:

  1. Passive observation or “honest-but-curious” adversary. Amazon behaves as promised and provides the storage service without going out of its way to undermine customer privacy. They have full visibility into what is going on in the cloud but not on the gateway VM. For example they can not pull system logs from the gateway or observe raw iSCSI traffic received.
  2. Active attacker: in this model anything goes— for example Amazon can send malicious updates to the gateway and tamper with contents after storage, returning different blocks than what the customer wrote. This will be taken up in the section on security.

Let’s start with the first problem. First we observe that any data sent to the cloud has been previously encrypted using a sound full-disk encryption scheme such as LUKS or Bitlocker. As long as this is strictly true, the bits stored at AWS provide no meaningful information about the original contents. There are two important qualifications to this.

First the volume must have been encrypted before any use or if encryption is added after-the-fact, it must have been applied to the entire volume. Otherwise the scheme risks exposing fragments of data leftover from previous, unencrypted content. For example Bitlocker will not encrypt unused space on disk by default, assuming that it is blank. That assumption does not hold if the disk has been in use— in that case “free space” on disk could actually hold leftover, unencrypted data from previously deleted files that is now exposed to cloud provider.

Conversely, when disk encryption is applied from the start but under the more generous assumption that unused space need not be encrypted, the “encrypted” disk image will contain a large number of all zero blocks. This allows the storage provider to observe exactly how much of the disk is being used at any given time. It also allows the provider to make inferences on new storage: for example if a new 1MB file is added to the filesystem, zero blocks totaling approximately that much space will be overwritten with random data. (Also note that changing this has an effect on the bottom line: ASG only charges for “used” space in the remote volume. Encrypting the entire disk including empty space will result in paying for total capacity out of the gate, instead of only the fraction in use.)

The previous observation about observing write-access in the cloud applies more broadly than merely distinguishing used vs unused space. Recall that gateway VMs synchronize writes to the cloud. Corollary: Amazon can observe which blocks on disk are being modified. While they have no idea what the contents of those blocks are, they could soon build a list of which blocks change more frequently and if there are specific access patterns such as block #5 always being written immediately after block #8. This alone may allow fingerprinting of remote operating system and applications; filesystems are constantly busy doing book-keeping on file meta-data even when clients are not explicitly writing. (For example, the last-access time of a file gets modified whenever the file is read.) Similarly background tasks such as those responsible for indexing disk contents for full-text search can leave tell-tale signs in their access pattern. Note that filesystem type is not hidden at all: both LUKS and Bitlocker use a well-defined header structure that would be visible to the cloud provider.

By contrast read operations are not visible to the cloud, being served directly from the local gateway VM. Amazon does not learn which blocks are requested since those requests are not propagated up to the cloud.

One bit of good news is that full disk encryption in this situation really means full-disk: unlike the case of encrypting boot volumes, here 100% of storage can be encrypted. By contrast a volume that contains the operating system must have some unencrypted partitions to make the boot sequence possible— including one containing the FDE implementation that can decrypt the rest of the operating system.

Security: what Amazon could see

The picture gets more complicated once we expand the threat model to take into account active attacks from AWS going rogue. In particular, all gateway software is controlled by Amazon and supports remote updates. Under normal circumstances they are considerate enough to give notice and ask for permission before applying updates:

Screen Shot 2019-07-20 at 10.12.33 AM.png

AWS Console makes it clear the company can remotely force update gateway VMs

But this is merely a courtesy; a conservative assumption is that the company can remotely install malicious software on any gateway anytime. This is not because of the restricted shell given by default— since the gateway is distributed as a VM, it is trivial to modify its virtual disk image and obtain root on the box for looking around:

AWS_gateway_JAR_files.png

AWS Storage Gateway uses a Java application and some glues scripts to present iSCSI interface to local clients while synchronizing contents to S3

A determined customer could install additional software or change the configuration of the VM, and possibly disable remote updates. But the problem is without fully reverse-engineering every piece of software on the gateway, it is still not possible to rule out the possibility of a backdoor inserted by AWS. For this reason, we will model the gateway VM as a blackbox controlled by the adversary.

In that threat model denial-of-service can not be prevented: adversary can always brick the gateway or decline to serve any data. But this is not any different from AWS carrying out the same attack in the cloud. Nothing prevents Amazon from deleting customer data from S3 or holding it hostage, other than the fear of PR reprisals or future impact on lost revenue.

Access patterns: ORAM threat model

With control of the gateway, Amazon can also observe read operations, in addition to observing writes from the vantage point of the cloud. This begins to look a lot like the oblivious RAM (ORAM) threat model in theoretical cryptography: an adversary observes all memory accesses made by a program and attempts to deduce information about the program— such as secret inputs— from that pattern alone. Quantifying such risks in the context of remote storage is difficult. It is common for volatile memory access to be dependent on secrets. For example, bits of an RSA private key are often used to index into a lookup table in memory, resulting in different access patterns based on the key. That property is directly exploitable for side-channel attacks. It is much less common to use persistent storage that way. Nevertheless there is information contained in the sequence of blocks read or written that the gateway learns. For example, it can highlight “hot spots” on disk containing frequently accessed contents. It may even be possible link an iSCSI volume to a particular service by observing a correlation between requests. Contrived example: suppose a customer is running a web-server backed by the iSCSI volume on a gateway. Sending HTTP requests to the server to download different files and observing resulting iSCSI requests to the gateway VM could build a convincing case that this site is serving up content straight out of the gateway.

[continued]

CP

 

Revocation is hard, the code-signing edition (part I)

Issuance vs revocation

Certificate revocation remains the Achilles heel of public-key infrastructure systems. The promise of public-key cryptography was stand-alone credentials that can be used in a peer-to-peer manner: a trusted issuer provisions the credential and then steps aside. They are not involved each time the credential is used to prove your identity. Contrast that with prevalent models on the web today such “login with Facebook” or “login with Google.” These are glorified variants of an enterprise authentication system called Kerberos that dates from the 1980s. Facebook or Google sit in the middle of every login, mediating the transaction. In the PKI model, they would be replaced by a trusted third-party, known as the “certificate authority,” issuing long-lived credentials to users. Armed with that certificate and corresponding private key, the person could now prove their identity anywhere without having to involve the CA again. This is the model for proving identities for websites: the TLS protocol, that once ubiquitous lock-icon on the address bar or green highlight around the name of the website, is predicated on the owner of that site having obtained a digital certificate from a trusted CA.

Revocation throws a wrench in that simple model. Credentials can expire but they can also be revoked: the website could lose their private-key or one of the assertions made in the certificate (“this company controls the domain acme.com”) may stop being true at any point. There are two main approaches to revocation not counting a home-brew design used by Google in Chrome, presumably because the relevant standards were not invented in Mountain View. The first one is Certificate Revocation Lists or CRLs: the certificate authority periodically creates a digitally signed list of revoked certificates. That blacklist is made available at a URL that is itself encoded in the certificate, in a field appropriately called the CDP, for “CRL Distribution Point.” Verifiers download CRLs periodically— each CRL defines its lifetime, to help with caching and avoid unnecessary polling— and check to see if the serial number of the certificate they are currently inspecting appears in the blacklist. There are additional optimizations such as delta-CRLs that can be used to reduce the amount of data transferred when dealing with large-scale PKI deployments where millions of certificates could have a revoked status at any time.

If CRLs are the batch mode for checking revocation, Online Certificate Status Protocol is the interactive version. OCSP defines an HTTP-based protocol for verifiers to query a service affiliated with the CA to ask about the current status of a certificate. Again the location of that service is embedded in the certificate in the “Authority Information Access” field. Interestingly OCSP responses are also signed and cacheable with a well-defined lifetime, which allows optimizations such as OCSP stapling.

Taken in isolation, the cost of doing a revocation check may seem daunting: web-browsers try to shave milliseconds from the time it takes to render a web page or start streaming that video. Imagine having to pause everything while a CRL is downloaded in the background or OCSP status queried. But recall that this cost is amortized over thousands of certificates: one CRL covers not just one website but all websites with a certificate from that CA. Considering that a handful of public certificate authorities account for the majority of TLS certificates, not having a fresh CRL already present would be the exceptional scenario, not the norm. More importantly the retrieval of this information need not be done in real-time: for example Windows has sophisticated heuristics to download CRLs for frequently used CAs in the background. It can even prefetch OCSP responses for frequently visited individual sites, based on a predictive model that those sites are likely to be visited again.

Weakest link in the (certificate) chain

There is one aspect of PKI operation that no amount of client-side optimizations can compensate for: CA incompetence. If CAs are not promptly acting on reports of lost credentials, failing to maintain CRLs correctly or experiencing frequent outages with their OCSP responders, relying-parties will at best remain in the dark about the status of certificates— having to make a difficult decision on what to do if revocation status can not be determined— or worse, reach the wrong conclusion and make trust decisions based on a bogus credential. Cases of CA ineptitude in the case of TLS certificates are legion, and need not bear repeating here other than to point out an economic truth: it would be astonishing to expect a different outcome. All CAs trusted by popular web-browsers are effectively on equal standing; there is nothing distinguishing a certificate issued by a competent CA from one that constantly screws up. In both cases the websites get their all-important lock icon or avoid being publicly shamed by Google Chrome. (Extended validation certificates were introduced a decade ago with better vetting requirements, but that only had the effect of creating two tiers. All EV-qualified issuers still stand on equal ground in terms of trust afforded by browsers.) That means issuers can only compete on price. There is zero incentive for CAs to compete on operational competence, beyond the minimal baseline required by the CA/Browser Forum to avoid getting kicked out. It turns out even that modest baseline is too high a bar: Symantec managed to get kicked out of both Mozilla and Google roots.

Operational errors by certificate authorities have been well documented in the context of TLS certificates. From relatively “minor” infractions such as issuing certificates to unauthorized parties pretending to be affiliated with major websites to epic failures such as accidentally signing subordinate certificates—effectively delegating ability to issue any certificate to the bad guys— public PKI has been veritable zoo of incompetence. It turns out that code signing and revocation create even more novel ways for CAs to stumble.

Same mathematics, different purposes

A public-key algorithm such as RSA can be used for different purposes such as encryption, authentication or digital signatures. The underlying computations are same: whether we are using RSA to protect email messages in transit or prove our identity to a remote website, there is a secret value that is used according to the same mathematical function. While  there are superficial differences in formatting and padding, there is no reason to expect a fundamental difference in risk profile. Yet each of these scenarios have different consequences associated with loss or compromise of keys. (To be clear,  “loss” here is used to mean a key becoming permanently unusable, both for the legitimate owner and any potential attacker. By contrast “key compromise” implies the adversary managed to get their hands on the key.)

Consider the case of a key used for authentication. In a realistic enterprise scenario, this could be a private key residing on a smart-card used for logging into Activity Directory. If that key is rendered unusable for some reason— let’s say the card inadvertently went through a cycle in the washing machine— the user will be temporarily inconvenienced. Until they replace the credential, they can not access certain resources. Granted, that inconvenience could be significant. Depending on group policy, they may not even be able to login to their own machine if smart-card logon is mandatory. But as soon as the IT department issues a new card with a new set of credentials, access is restored. The important point is that the new card will have equivalent credentials but different key-pair. Previous key was permanently bound to the malfunctioning hardware. It can not be resurrected. Instead a new key-pair is generated on the replacement card and a corresponding new X509 certificate is issued, containing the new public key. As far as the user is concerned, the new key is just as good as the previous one: there is nothing they are prevented from doing that used to be possible with the previous key material.

It would be an entirely different story if that same smart-card was used for email encryption, for example S/MIME which is popular in enterprise deployments. While we can reissue this employee a new card, the replacement can not decrypt email encrypted to the previous key. If there was no backup or key-escrow in place, there has been permanent loss of functionality: past encrypted correspondence has become unreadable. That goes beyond a “temporary inconvenience” IT department can resolve. This is one example of a more general observation: loss or compromise of a key can have different consequences depending on the high-level scenario it is used for, even when all of the cryptographic operations are otherwise identical.

Loss vs compromise

At first glance, the signature scenario looks similar to authentication as far as the consequences from key loss are concerned. If a signing key becomes unusable, simply generate a new one and start signing new messages with that replacement key. This is indeed the approach taken in systems such as PGP: users still have distribute the new key, perhaps by publishing it on a key-server. There is no need to go and re-sign previous messages: they were signed with a valid key, and recipients can continue to validate them using the previous key. (Note we are assuming that recipients are flexible about choice of signing key. There are other systems— notably cryptocurrencies such as Bitcoin— where control over funds is asserted by signing messages with one particular key which can not be changed. In that model, loss of the private key has serious repercussions, directly translating to dollars.)

The picture gets murky once again when considering compromise of the key, defined as adversaries coming into possession of the secret. For encryption keys, the results are straightforward: attackers can read all past messages encrypted in that key. The best-case scenario one can hope for is publicizing the replacement quickly, to prevent any more messages from being encrypted to the compromised key. Not much can be done to save past messages, beyond purging local copies. If an attacker can get hold of those ciphertexts by any avenue, they can decrypt them. There is no way to “revoke” all existing copies of previously encrypted ciphertexts. In fact the adversary may already have intercepted and stored those ciphertexts, optimistically betting on future capability for obtaining the decryption key.

Consequences from the compromise of a signing key are more subtle, because time enters into the picture as a variable. Authentication commonly occurs in real time and once, with a single known counter-party attempting to verify your identity. Signatures are created once and then verified an unbounded number of times by an open-ended collection of participants indefinitely into the future. That creates two complications when responding to a key compromise:

  • Future messages signed by the attacker using the compromised key must be invalidated, such that recipients are not accidentally tricked into accepting them
  • Yet past messages made by the authorized signer when they still had full control over the private-key must continue to validate properly.

To appreciate why that second property is critical, consider code-signing: software publish digitally sign the applications they distribute, to prevent malware authors from tricking users into installing bogus, back-doored versions imitating the original. Suppose a publisher discovers that their signing key has been compromised and used to sign malware— a very common scenario that figured prominently in high-profile campaigns such as Stuxnet. The publisher can quickly race to revoke their signing certificate. Client properly implementing revocation checks will then be protected agains the signed malware, rejecting the signature as invalid. But what about legitimate pieces of software that distributed by the same publisher? Indiscriminate revocation would result in all of those apps also becoming untrusted, requiring the publisher to not only re-sign every app they ever shipped but also guarantee that every single copy existing anywhere a customer may download apps from is replaced by the updated version.

[continued]

CP

Cloud storage with end-to-end encryption: AWS Storage Gateway (part I)

Cloud storage sans surveillance-capitalism

This post picks up on a series of experiments from 2013, originally written in the aftermath of the Snowden disclosures. These experiments started with one question: how feasible is it to use cloud storage services as a glorified remote drive, without giving the service provider any visibility into data stored? This approach stands in stark contrast to how most cloud providers would prefer their services to be used. Call it the Surveillance Capitalism approach to storage: Google Drive, Dropbox and Microsoft One Drive all operate in terms of individual files. While each provider may proudly tout their encryption-in-transit and encryption-at-rest to protect those files as they bound around the internet, they all gloss over one inconvenient detail: the provider has access to the contents of that file. In fact the whole business model is predicated on being able to “add value” to contents. For example if it is a PDF, index the text and allow searching by keywords across your entire collection of documents. If it is a photograph, analyze and automatically tag the image with names using facial recognition. For the most part, all of these applications require access to the cleartext content. While there is a nascent research field for working with encrypted data—where the service provider only has access to encrypted contents but can not recover the original plaintext— these applications are largely confined to a research setting. “Trust us,” the standard Silicon Valley bargain goes: “we need access to your data so we can provide valuable services at zero (perceived, upfront) cost to you.

This approach underserves a vocal segment of consumers who are uncomfortable with that trade-off, who would gladly dispense with these so-called value adds or pay a premium in exchange for better privacy. Specialized services cropped up in the aftermath of Snowden revelations catering to that segment, promising to provide encryption-at-rest with keys only held by the customer— Bring Your Own Keys or BYOK model. Yet each offering was accompanied by the baggage of home-brew design and proprietary clients required to access data protected according to that model. This made integration tricky, because protecting remote data looked nothing like protecting local data. Each platform already has a de facto standard for encrypting local disk drives: Bitlocker for Windows, LUKS for Linux and Filevault on OSX.  Their prevalence lead many individuals and organizations to adopt key management strategies tailored to that specific standard, designed to achieve desired security and reliability level. For example an organization may want encryption keys rooted in hardware such as TPM while also requiring some recovery option in case that TPM gets bricked. Proprietary designs for encrypting remote storage are unlikely to fit into that framework or achieve the same level of security assurance.

AWS Storage Gateway

AWS Storage Gateway product is hardly new. Part of the expanding family of Amazon Web Services features, it was first introduced in 2012. Very little has changed in the way of high-level functionality— this blog post could have been published seven years ago. While AWS also provides file-oriented storage options such as S3 and Glacier, ASG operates on a different model: it provides an iSCSI interface, presenting the abstraction of a block device. An iSCSI volume is accessed the same way a local solid-state or spinning drive would be addressed in terms of chunks of storage: “fetch the contents of block #2” or “write these bits to block #5”). One corollary is that existing operating system features that work on local disks also work on remote iSCSI volumes— modulo minor caveats. In particular full disk encryption schemes can be used to encrypt them in the same way they can encrypt local drives— an earlier blogpost walked through the example of Windows Bitlocker-To-Go encrypting an iSCSI volume using smart-cards.

The “gateway” itself is either a dedicated hardware appliance available for sale or virtual machine that can be hosted on any common virtualization platform. Management of the appliances is split between the AWS Management Console and a restricted shell running on the appliance itself.

Screen Shot 2019-07-15 at 7.27.15 AM.png
Gateway VM on VMware Fusion, logged into the restricted shell for configuration

iSCSI beyond the local neighborhood

ASG represents more than a shift from one protocol to another more convenient protocol. After all, no one needed help from Amazon to leverage iSCSI; it is commodity technology dating back two decades. It does not even require specialized storage hardware. Windows Server 2012 can act as a virtual iSCSI target, providing any number of volumes with specific size that can be accessed remotely. So why not launch a few Windows boxes in the cloud— perhaps at AWS even— create iSCSI volumes and call it a day?

The short answer is iSCSI is not designed to operate over untrusted networks. It provides relatively weak, password-based initial authentication and more importantly, provides no security on the communication link. The lack of confidentiality is not necessarily a problem when one assumes data itself is already encrypted, but lack of integrity is a deal breaker: it means an adversary can modify bits on the wire, resulting in data corruption during reads or writes. Granted, sound full-disk encryption (FDE) schemes seek to prevent attackers from making controlled changes to data. Corrupted blocks will likely decrypt to junk instead of a malicious payload. But this is hardly consolation for customers who lose valuable data. For this reason iSCSI is a better fit inside trusted local networks, such as one spanning a datacenter. On the other hand, if servers providing those iSCSI targets are inside the datacenter, one has not achieved true “cloud storage” in the sense of having remote backups— now those serves have to be backed up some place else in the cloud, outside the datacenter.

ASG provides a way out of that conundrum. On the front end, it presents a traditional iSCSI target that other devices on the network can mount and access. While that part is not novel, the crucial piece happens behind the scenes:

  • ASG synchronizes contents with AWS S3. That channel does not use iSCSI; otherwise tit would be turtles all the way down. Instead Amazon has authored a custom Java application that communicates with Amazon using an HTTP-based transport protected by TLS.
  • ASG also has intelligent buffering to synchronize write operations in the background, based on available bandwidth. To be clear, ASG maintains a full copy of the entire disk. It is not a caching optimization designed to keep a small slice of contents on frequency of access. All data is locally present for read operations. But writes must propagate to the cloud and this is where local, persistent buffering provides a performance boost by not having to block on slow and unreliable network connections to the cloud. If the VM crashes before incoming writes are synchronized to the cloud, it can pick up and continue after the VM restarts.

Encrypted personal cloud storage

Local_VM_for_AWS_storage.001.png

Here is a simple model for deploying AWS Storage Gateway to provide end-to-end encrypted personal storage on one device. This example assumes Windows with a virtualization platform such as Hyper-V or VMware Workstation:

  • AWS Storage Gateway will run as a guest VM. While ASG is very much an enterprise technology focused on high-end servers, its resource requirements are manageable for moderate desktops and high-end laptops. iSCSI is not particularly CPU intensive but AWS calls for ~8GB memory allocated to the VM, although the service will run with a little less. It is however more demanding of storage: buffers alone require half terabyte of disk space even when the underlying iSCSI volume itself is only a handful of GB. AWS software will complain and refuse to start until this space is allocated. Luckily most virtualization platforms support dynamically resized virtual disk images. The resulting image has an upper bound on storage, but only takes up as much space as required to hold current contents. This makes it possible to appease the gateway software without dedicating massive amounts of space upfront.
  • Configure networking to allow inbound connections to the VM over the internal network shared between host & guests. The VM must still have a virtual adapter connected to an external network, since it needs to communicate with AWS. But it will not be allowed to accept inbound iSCSI connections from that interface.
  • Use the Windows iSCSI initiator to access the iSCSI target over the local virtual network shared between hosts & guests
  • After the disk is mounted, create an NTFS-formatted volume and configure Bitlocker disk encryption as usual. Windows Disk Manager utility treats the AGS volume as ordinary local disk. In fact the only hint that it is not a vanilla hard drive is contained in the self-reported device name from the gateway software.
AWS_volume_as_local_disk.png

Windows Disk Management view of an iSCSI volume hosted by AWS Storage Gateway

This model works for personal storage but poses some usability problems. In particular it requires a local VM on every device requiring access to cloud storage and does not allow concurrent access from multiple devices. (Mounting the same iSCSI target from multiple initiators in read/write mode is an invitation to data corruption.) The next post will consider a slightly more flexible architecture for accessing cloud data from multiple devices. More importantly we circle back to the original question around privacy: does this design achieve the objective of using cloud storage as a glorified drive, without giving Amazon any ability to read customer data? Considering that the AWS Storage Gateway is effectively blackbox software provided by Amazon and accepts remote updates from Amazon, we need to carefully evaluate the threat model and ask what could happen in the event that AWS goes rogue or is compelled by law enforcement to target a specific customer.

[continued]

CP

 

Ethereum mixing with RSA: getting by without zero-knowledge proofs

Old-fashioned and unpopular as RSA may have become, it is a versatile public-key crypto-system. Starting with Bitcoin, cryptocurrencies have shunned RSA in favor of  signatures based on elliptic curves, initially ECDSA and later moving towards pairing-based cryptography. Ethereum is the lone exception, having added native RSA support with EIP-198. “Native” being the operative keyword. In principle the smart-contract language used by Ethereum is Turing-complete and can implement any computation. In reality computations are bounded by the amount of gas charged which creates two types of limitations. First is a hard-limit by the maximum amount of gas that can be consumed by all transactions in one block, no matter how motivated a user may be to get a complex transaction mined. The second one is a soft, economical incentive to favor cheaper computations when possible. ECDSA signature verification is artificially “cheap” because it is not implemented as ordinary EVM bytecode. Instead it is a special external contract that can be invoked by anyone, at a deeply discounted price compared to what it would have cost to implement the same complex operation from scratch. EIP-198 brings RSA into this model, although the discount is not quite as deep; on any reasonable hardware architecture RSA signature verification is much faster than ECDSA verification. But the arbitrary gas pricing set by EVM inexplicably charges more for the former.

Strictly speaking EIP-198 adds support for modular exponentiation, which is a useful primitive operation that enables more than only RSA signatures. For example El Gamal over the integers and other crypto-systems based on discrete-logarithms are now in the realm of possibility. More importantly, RSA has useful mathematical properties that are notably absent from ECDSA and enable new scenarios. This post covers one example: trustless mixing of Ethereum.

Background on mixers

Mixers are designed to improve privacy by shuffling funds on the blockchain in such a way that sources and destinations are not connected. To take a concrete example: suppose Alice and Bob have 1 ETH each, stored at known addresses. These are the inputs. They would like to shuffle these funds such that each one ends up with 1ETH again— perhaps minus some transaction fees— but at new addresses which are not linkable to original ones by anyone else with full view of the blockchain. These are the outputs. The choice of 1 ETH is arbitrary but it is important that everyone contributes the same amount when there are exactly as many inputs as outputs. Otherwise the quantity itself becomes a signal, deanonymizing the link between inputs and outputs. If Alice contributed a larger input, the larger output also belongs to her.

With two people, the privacy improvement is marginal: since there are only two addresses to begin with, any given output from this process could only have come from exactly one of two original inputs. The best case scenario one could achieve in this setting is that any outside observer will have no more than 50/50 chance of guessing which way the inputs got shuffled. Also Alice and Bob will always know exactly which input corresponds to which output since they control the other one. As the saying goes, privacy loves the company of large numbers.  Uncertainty in the origin of funds increases with more participants.

But an even more important criteria is soundness. Informally stated, participants will not lose money under any circumstance during the execution of the protocol. Here is a straw-man example that fails on that dimension:

  1. Alice informs Bob her preferred destination Ethereum address
  2. Bob responds with his choice of address
  3. Alice and Bob flip a coin to decide how to map inputs to outputs
  4. If  the coin lands heads, Bob sends 1ETH to Alice’s address. Otherwise he sends the same amount to his own address
  5. After observing Bob’s transaction confirm on the blockchain, Alice in turn sends 1ETH either to Bob or herself

This protocol has a glaring weakness. If Alice receives 1ETH from Bob after step #4, she can diverge from the script and keep the extra ether for herself. For that matter, Alice may be unable to complete the protocol: suppose her hardware wallet failed and she no longer has access to any of her private keys. There is nothing to guarantee that steps #4 and #5 are done atomically: either both happen or neither happen. Once started, it must be either run to completion or rolled-back by refunding the money Bob contributed.

While it is possible to repair that flaw using fair-exchange protocols, there is a more fundamental problem with this approach: it assumes that Alice and Bob somehow found each other ahead of time and they have a private channel for coordinating their activities off-chain. This is not a scalable solution, especially when the protocol is generalized to support more than two participants. That only gets worse with scaling beyond two users to design a mixer that can accept hundreds of inputs. All of those actors must coordinate and agree on one permutation of inputs to outputs while minimizing what each person learns— otherwise the mixer is easily defeated by a single rogue participant who infiltrated the group— while guaranteeing that no one will lose funds even if everyone else in the group has conspired against them.

Trusted third-parties as deus ex machina

If we posit the existence of a trusted third-party Trent, trivial solutions emerge:

  1. Alice, Bob, Carol and anyone else interested in mixing funds, sends 1ETH to Trent via Ethereum blockchain
  2. Every participant also sends their preferred output address to Trent. This part must be done privately and can not be additional metadata sent along with ETH contribution in step #1; otherwise the entire world knows the intended shuffle.
  3. Trent sends every participant approximately 1ETH to their preferred address, minus some commission for facilitating this transaction

This protocol loses both soundness and privacy if Trent goes rogue. After collecting all the funds in step #3, he can abscond with them. Similarly Trent has full knowledge of the mapping between inputs and outputs. He can disclose this information at any point in the future, voiding any privacy advantages afforded by a service the participants have already paid for.

Contracts for holding trusted third-parties accountable

Ideally Trent is replaced with an immutable smart contract guaranteed to stick to the script. Once Ethereum has support for zero-knowledge proofs via SNARKs, this can be done in an autonomous way. But old-fashioned RSA alone is sufficient to enable a middle-ground solution: there is still a trusted third-party involved  to facilitate the protocol but they are prevented from stealing funds or deanonymizing the permutation.

Protocol sketch

Trent publishes a new contract with known EVM bytecode and deposits some funds as “earnest money.” This is effectively a way to incentivize Trent into behaving properly. If Trent executes the protocol faithfully, he will recover the deposit along with any commissions taken for the service. If he diverges from the protocol, the funds will be distributed to participants. Trent generates a unique RSA key-pair and embeds the corresponding public-key in the contract.  Once the contract is initialized, execution operates in three stages: collection, opening and distribution, each with specific deadlines that are hard-coded into the contract. In the first stage, the contract collects funds along with preferred output addresses from participants.

  1. Every participant interested in using the mixer sends 1ETH to the contract, along with a message to be signed by Trent. The contract records the deposit along with the sending address, but does not consider it finalized until Trent has acknowledged it.
  2. Trent must call a different contract method and provide an RSA signature over the requested message. The contract can verify this RSA signature to determine whether Trent signed correctly and only then consider the contribution final.
  3. If Trent fails to complete that step after some deadline, the participant can request a refund (The contract could even be designed to penalize Trent for discriminating against participants, by sending extra funds taken from the initial deposit.)

In the second stage, users reveal their choice of destination address along with a signature obtained from Trent in step #2 above. This is done by invoking another method on the contract to provide the address along with a signature from Trent by calling the contract. One subtlety related to blockchains: participants must use a different Ethereum address when interacting with the contract in each stage. Otherwise the origin of the TX itself allows linking the revealed address to the original request for signing.

Blinded by randomness

This is where unique properties of RSA shine. If users were submitting the destination address verbatim in step #1, it would defeat the point of using a mixer— everyone can observe these calls and learn which address each person picked. But RSA enables blind signatures: Trent can sign a message without known what message he signed. The crucial property is that the message submitted for signing is related to the real message the participant wants signed namely, the destination address. But that relationship is only known to the participant: as far as everyone watching the blockchain is concerned, it appears indistinguishable from a random message. Not even the party in possession of the signing key (Trent, in this case) learns the “true” message being signed or can correlate a signed message presented in the future with the original request on the related message. In effect, users mask their address before submitting it to Trent for signature and then recover the intended signature by using properties of RSA. (There is some additional complexity being glossed over: signatures are computed over the address directly. Instead the address is hashed and suitably padded with a scheme such as PSS. Otherwise RSA allows existential forgery such that one can find unlimited message-signature pairs, although these messages will not have any particular meaning as far as corresponding to an ethereum address.)

To avoid waiting on stragglers indefinitely, a deadline is imposed on the second stage. After this deadline is reached, the contract can start actually disbursing funds to the addresses opened in the second stage. There is no concept of a “scheduled task” on the Ethereum blockchain so the final stage will be initiated when any participant— including potentially Trent— calls into the contract to request distribution after the deadline has elapsed. At this point the contract can confirm that the timestamp or block height is past that deadline and start sending ether to previously verified output addresses.

Detecting and punishing dishonest operators

There is one flaw in the protocol as described: Trent can cheat by issuing extra signed messages. Recall that the presence of a valid RSA signature on a message authorizes the disbursal of 1 ETH to the address contained in that message. That turns every valid signature into a check worth 1ETH. While every participant is supposed to receive one signature for every ETH contributed, nothing prevents Trent from issuing signatures over his own addresses and attempting to cash these in.

This is where the initial “earnest money” contributed by Trent comes in, combined with the deliberate delay in releasing funds. Recall that that funds are not disbursed immediately when a participant calls into the contract with a valid signature. Instead the contract waits until a predetermined deadline, giving everyone a chance to chime in with their address. As long as every honest participant supplies their address, cheating will become obvious. The contract has a record of exactly how many participants have received valid signatures in response to their contribution. If more signatures turn up than participants, it is clear Trent has diverged from the protocol. It could be a deliberate attempt to steal funds or perhaps Trent’s RSA signing key was compromised. Either way, the contract will respond by “refunding” all ether to the original input addresses, along with a sjare of the earnest money put up by Trent. In this case the protocol execution has failed: no mixing has occurred. But all participants receive 100% of their original contribution back along with some compensation for Trent’s incompetence/dishonesty.

Participants do not have to do anything special to catch and punish Trent for cheating, or monitor the blockchain. As long as they are following the protocol and revealing their own signed address before the deadline, any attempt by Trent to fabricate additional unsolicited signatures will backfire and result in “gifting” money. The only way Trent can get away with cheating is if some participants have failed to reveal their signed address in a timely manner, effectively abandoning their funds. Even in that scenario, Trent would be disincentivized from claiming those funds with forged signatures: he would be taking the risk that missing participants may turn up at the last minute and trigger the retaliation logic.

CP

 

DIY VPN (part II)

[continued from part I]

Economics of a self-hosted VPN

This section considers the overall cost of operating a VPN server. First, we can rule out software acquisition costs. On the server side, the community edition for OpenVPN is freely available as open-source software. (The commercial dubbed OpenVPN Access Server comes with a per-user licensing model. While adding useful fit-and-finish including configuration GUI, that version does not change the fundamental scaling or security properties of the system.) On the client side, there are freeware options on all major platforms. One exception is when ADCS is leveraged for the certificate authority. Even there one can lease hourly access to a private Windows server from major cloud providers, rolling that into operational expenses instead of paying outright for the software. As far as operational expenses are concerned, what follows is a back of the envelope calculation using AWS as the infrastructure provider. While AWS has a free-tier for new customers, this analysis assumes that discount has already been exhausted, in the interest of arriving at a sustainable cost basis.

Computation

OpenVPN runs on Linux and is not particularly CPU intensive at the scale envisioned here— supporting a handful of users with 2-3 simultaneous connections at peak capacity. A single t3-nano instance featuring a single virtual CPU and half GB of RAM is perfectly up to task for that load. Based on current AWS pricing, these cost roughly half cent per hour when purchased on demand in US regions such as Virginia or Oregon. The operational cost can be reduced by committing to reserved instances. Paying upfront for one-year takes the price down to 0.30¢/hour while a three year term yields 0.20¢/hour. Even greater savings and flexibility may be achieved with what AWS terms  spot instances: current spot-prices for nano instances hover around 0.16-0.20¢ per hour depending on availability zone. However spot instances comes with the risk that the server may be preempted if demand spikes to the point where prices exceed the bid. That level of reliability may still be acceptable for a personal setup, since one can always bid on another spot instance with higher price. Conservatively using the 0.20¢ estimate, we arrive at EC2 instance costs just shy of $1.50 per month.

Storage

It turns out that the cost of running an instance is not the primary consideration, because of the modest computational requirements of openvpn. When using minimal instance types such as t3-nano, disk space becomes a significant contributor to cost. At the time of writing, AWS charges 10¢ per GB/month for entry-level SSD volumes attached to instances. Costs only go up from there for more demanding loads and guaranteed IOPS. (While there are cheaper options such as spinning disks, those can only be used in addition to existing boot volume and not as replacement.) That means simply keeping a 15GB disk around for a month alone exceeds the cost of running the instance for that same period.

Fortunately Linux distributions are also modest in their demands for disk space requirements. Minimum requirements for Ubuntu18 servernot the desktop edition, which is considerably more resource hungry— weight in at a svelte 256MB of memory and 1.5GB of disk space. It is not possible to translate that directly into an AWS image: all popular AMIs offered by Amazon come with minimum 8GB disk space, more than five times the stated baseline. AWS will not allow launching an EC2 instance based on one of these AMIs with a smaller boot volume.

Bring-your-own-image

There is a work around: AWS also allows customers to import their own virtual machine images from common virtualization platforms. That permits crafting a properly slimmed-down Linux image locally and converting it into an AMI.

Fair warning: this process is clunky and error-prone. After uploading an image to S3 and initiating the conversion as batch process, several minutes can elapse before AWS reports a problem (typically having to do with the image format) that takes users back to square one. It turns out only OVF format images accepted. VMware distributions including Fusion on OSX can export to that format using the included ovftool.

With some buffer to accommodate additional software inserted by AWS into the image, log files and future updates, we end up with an Ubuntu18 server AMI that can squeeze into 3GB SSD volume. That comes out to a recurring storage expense of 30¢ per EC2 instance. There is also the storage associated with the AMI itself, one copy per region. That carries the slightly lower price-tag charged to EBS snapshots and can be amortized across multiple instances launched in the same region. Assuming the worst case scenario of a single server, we arrive at an upper bound of 50¢ per month.

Bandwidth

While computation and storage expenses are highly competitive with typical prices charged by commercial VPN services, bandwidth is one area where Amazon is much less favorable for operating 24/7 VPN. Amazon charges 9¢ per GB for outbound traffic, defined as bits heading out of AWS infrastructure in that region. A VPN server has an unusual bandwidth symmetry. Most servers receive a small amount of data inboundsuch as a request for a web pageand  respond back with a large amount of data outbound, for example a high-resolution image or video stream. But a VPN server is effectively acting as proxy that shuttles traffic in both directions. Every incoming request from the client is forwarded to its true destination, becoming outbound traffic. Likewise every inbound response from that destination is sent back to the client as outbound traffic. So the total traffic volume for purpose of metering is closely approximated by the upstream and downstream traffic generated by users connected to the VPN, plus a small overhead introduced by the VPN protocol itself.

The average US household consumes 190GB of bandwidth per month according to the latest iGR report from 2017. While that report does not distinguish upstream/downstream, it is reasonable to assume downloading is responsible for the bulk of this figure. Adjusting by 50% for projected growth and including another 50GB for mobile devices with their own data plan yields a number around 300GB month of traffic. If 100% of that traffic is routed over the VPN service hosted at AWS, bandwidth costs alone would exceed all other other operational expenses by a factor of ten.

This cost scales linearly with the amount of VPN usage, or equivalently the number of users sharing the VPN. At large enough scales, the same also holds true for EC2 costs, since a single nano instance can not service an unbounded number of subscribers. But those limits are unlikely to be reached for a system intended for a handful of people, while the impact of additional bandwidth usage will be directly reflected in the monthly bill.

Verdict on self-hosting

AWS turns out to be highly competitive for hosting a VPN solution, with the caveat that only moderate usage is required. Considering that popular VPN services for consumers charge in the neighborhood of $3-$10 per month, a single nano instance can offer a better solution at lower cost, especially when the server is shared by multiple people. Advantages to self-hosting include:

  • Eliminating trust in a third-party service, including random client apps. VPN provider is in a position to observe traffic metadata, such as websites visited and frequency For unenlightened websites not using HTTPS, VPN service can even observe the full traffic exchange. Unscrupulous providers have taken advantage of this, most recently Facebook with its Onavo spyware product masquerading as VPN. Hosting your own VPN avoids that dependency, although one must still implicitly trust the infrastructure provider. Amazon has visibility into EC2 traffic which can be used to identify activity: “user coming from IP address 1.2.3.4 connected to the VPN server, and that resulted in VPN server reaching out to websites at address 6.7.8.9” While one can be confident AWS (unlike Facebook) respects privacy and will not randomly mine those logs to spy on their own customers, they can still be compelled to disclose records by law enforcement.
  • High-availability with the backing of Amazon infrastructure. Hardware, storage and networking failures are low probability events.
  • Geodistributed points of presence, with ability to host VPN servers in the US, Europe, South America and Asia.
  • Large selection of IP4 and IP6 addresses from the AWS range, compared to hosting from a residential connection.
  • More importantly, freedom to change IP addresses at will. New elastic IP can be requested from AWS and assigned to an instance anytime the service operator wants to discard their association with the existing address
  • Unlimited number of devices at no additional cost (within the capacity limits of the EC2 instance)

The self-hosting route also comes with two notable caveats:

  • Overhead of maintaining the service, particularly the PKI necessary for issuing/revoking certificates. OpenVPN does not help matters here: it does not directly implement CRL or OCSP for retrieving revocation information. It requires the presence of a CRL file deployed locally. (But one can get the same effect by crafting a custom script to be executed server-side for each connection, to invoke openssl for proper chain validation that pays attention to CDP or OCSP responder links in the certificate. Alternatively a cron job can periodically retrieve CRLs and update the local file openvpn expects to find this information.)
  • Steep pricing for bandwidth. This is less an argument against the DIY approach and more a cautionary note about using AWS for hosting, which is the obvious choice along with Azure and GCE. It turns out Azure and GCE are not that different when it comes to egress traffic. More cost-effective plans are available from alternative providers such as Linode and Digital Ocean, featuring upwards of 1 terabyte egress traffic for the fixed monthly price of server hosting.

CP