Cloud storage with end-to-end encryption: AWS Storage Gateway (part III)

[continued from part II]

The final post in this series pushes the threat model into tinfoil-hat territory: we posit Amazon going rogue or being compelled to recover private, encrypted data associated with users of the AWS Storage Gateway service.

Data corruption: attacks against integrity

An obvious avenue for Amazon to attack customers is by modifying gateway software to return corrupted data. As noted earlier, full-disk encryption schemes operate under a strict constraint that 1 block of plaintext must encrypt to exactly 1 block of ciphertext. No expansion of data is allowed. This rules out use of integrity checks to detect corruption of ciphertext. Every block returned by the gateway will decrypt to something; that could be the original data stored by the customer— or not. At most FDE can guarantee is that without the encryption key, Amazon can not return a ciphertext crafted to yield some plaintext of their choosing.

What can be achieved in this limited attacker model is highly dependent on content type. Scrambling a few pixels in a picture or ruining the beat for a few seconds in a song is unlikely to ruin anyone’s day. On the other hand, randomly changing figures in a spreadsheet used for financial reporting may have downstream implications— although such files have additional structure that will likely break when a 512-byte sector is perturbed. If remote storage is used for storing executable files, worst-case scenario is arbitrary code execution. Yet achieving that requires a level of control over data corruption that is unlikely with FDE. Replacing a block of code with random x86 instructions will result in a crash with high probability. 

More interesting potential targets are files holding configuration data which influence how software behaves. Again it is possible to come up with contrived examples where even a random change could alter system security level. For example consider a binary format where one field is assumed to hold a boolean flag determining whether a dangerous feature such as macros or debugging are enabled. A value of zero indicates off, any other value stands for on. With very high probability any change to ciphertext will flip this flag from “off” to “on,” enabling a potentially exploitable feature. (Granted exploiting that is still non-trivial: attacker would have to know the exact location of the file on disk along with offset of that critical field inside the file. Recall that the concept of a “file” does not exist at iSCSI layer: cloud storage provider only observes requests for reading and writing blocks containing encrypted data. Metadata including file names is not visible, although some properties such as existence of a file spanning particular blocks could be inferred from access patterns.)

Lateral movement: attacks against customer systems

A more promising attack strategy is Amazon pushing out a malicious update to the gateway VM that seeks to escalate privileges outside the gateway itself, taking over other systems in the customer environment. Objective: discover the secret key used for full-disk encryption by compromising the system that applies FDE over the raw block device. That attack vector exists even if a TPM or smart-card is used for key management; at the end of the day there is a symmetric key released to the FDE implementation, regardless of whether that secret was originally wrapped by a password typed by the user or cryptographic hardware.

Considering the gateway VM as a malicious system seeking lateral movement, we find that it has few resource available. These VMs require very little integration with the surrounding environment: they do not need to be joined to an Active Directory, they do not SSH into any other customer system or access resources from a file share. In fact it can be isolated from the environment with the exception of inbound ports for iSCSI ports and external communication to AWS. Focusing on inbound connections, only a handful of other customers systems interact with the VM and specifically over iSCSI interface. It would take a remote-code execution bug in the iSCSI initiator to allow lateral movement, by having the iSCSI target return iSCSI responses to the vulnerable initiator.

A more likely target is the physical machine hosting the gateway VM. While not common, vulnerabilities are periodically discovered in hypervisors that would permit guest VMs to “break out” and take over the host. (It is also a safe assumption that Amazon could conduct reconnaissance to identify which hypervisor their customer is running and choose an appropriate exploit, not wasting a VMware 0-day on a customer running Hyper-V) This risk is much higher for the simple deployment model, where gateway VM is colocated with the second machine (virtual or physical) that implements full-disk encryption. In that scenario, a full VM escape may not even be required: micro-architectural side channel attacks could permit AWS to recover an AES key or plaintext lying around in memory from computations performed by the colocated FDE implementation. Hosting the VM on dedicated hardware physically separated from the machine that implements full disk encryption mitigates this risk.

One deterring factor against these tactics— in addition to their long odds of success—is they would be very noisy. Deploying a malicious update or serving a tampered iSCSI volume to the gateway VM leaves trails on local disk. Amazon can not rule out the possibility that the customer is making periodic backups of their VM or copying remote files to another local device outside Amazon control. That would result in the customer having a permanent record of an attempted attack for future forensics.

Economics: cost of privacy

To summarize: AWS Storage Gateway provides a pragmatic model for using cloud-storage services for private storage— as a glorified drive storing encrypted bits, with no way for the service provider to decrypt those bits. This is in marked contrast from the standard model of cloud storage where providers have full visibility into customer data, and either promise to not to peek or only peek for specific business objectives such as advertising. As with all DIY projects, there are tradeoffs to achieving that level of privacy. From a cost perspective, there is no charge for using a gateway appliance per se. instead AWS charges standard S3 storage rates for the space used, along with standard EC2 rates for outbound data transfer from AWS cloud to the gateway. Inbound bandwidth from the gateway appliance to the cloud for backing up modified data is free, but there is a charge per-GB for new data written to the volume. To pick one data point, keeping 10TB of private storage of which 10% gets modified every month would cost about two to three times more compared to a standard consumer offering such as Google Drive, depending on AWS region since Amazon prices vary geographically. That does not include the cost for hardware and operational expenses for running the gateway appliance itself. That incremental expense may vary from nearly zero— when adding one more VM to an existing load that already runs 24/7— to significant if new hardware must be acquired for hosting the gateway appliance.

Unlike other DIY projects such as hosting your own VPN service where AWS is less cost-effective than alternative hosting services, in this scenario AWS fares much better competitively. Cost of self-hosted VPN is largely dominated by bandwidth, where the AWS fee structure for charging by GB quickly loses against flat pricing models. But when it comes to using the cloud as a glorified disk drive storing bits, the specialized nature of AWS Storage Gateway has an edge. Replicating the same solution by attaching large SSD storage to a leased virtual server would neither cost effective or achieve a comparable level of redundancy as leveraging AWS.


Cloud storage with end-to-end encryption: AWS Storage Gateway (part II)

[continued from part I]

Greater flexibility for cloud storage can be achieved by deploying both the iSCSI target and the iSCSI initiator on one dedicated device on the local network. This could be a small form-factor PC running a hypervisor or even dedicated NAS device with virtualization support. The initiator would run a standard Windows or Linux image, encapsulating the remote storage and presenting a simpler interface such as network file share. For example, in the case of MSFT environments the VM could represent the iSCSI volume as an ordinary SMB share that other devices on the network can remotely mount and use. In this case read/write access to the underlying device is managed by the initiator VM. That second VM becomes responsible for handling concurrent access, specifically to avoid parallel writes from different clients clobbering the file-system. It would also present additional access controls based on Windows authentication at the level of individual files. (“File share” is a misnomer on Windows since the unit of storage shared is a directory or even entire drive.) This allows further compartmentalizing data: for example the media-player can be granted access to the directory containing the MP3 collection but not an adjacent directory that has contains backups of financial information. It’s worth pointing out that all of these abstractions— including that of individual “files” with permissions— only exists on the initiator VM. As far as the gateway and AWS is concerned, there is just one flat volume containing seemingly random bits; one of properties of an “ideal” cipher is that its output is indistinguishable from random data.

This model allows multiple devices to use cloud storage concurrently with one caveat: they must have network access to the device hosting the initiator/target VM combination. That still falls short of the “access your files from anywhere” goal already achieved by popular cloud storage services. For example if the VMs are hosted on a home network, those files remain inaccessible when the user is traveling, unless they have the capability to VPN into their home network. This restriction is less significant in an enterprise scenario since large enterprises typically have an internal network perimeter along with VPN services for employees to connect. (It may even be considered a “feature” that inadvertently implements a control popular with IT auditors: corporate data is only accessible from trusted networks.) There are two workarounds, both kludgy:

  • Mount cloud storage as read-only from a local gateway VM that can roam with the user— this is more tricky than it sounds. Depending on filesystem, the notion of whether a volume is “read-only” is itself part of the filesystem.
  • Create multiple iSCSI targets in AWS Storage Gateway, each one designed for exclusive access from a specific environment such as home network or specific laptop.

Privacy: what Amazon sees

Now we turn to the question of what the cloud storage provider can learn about the data owned by the customer in this setting. It is helpful to separate this into two distinct threat models:

  1. Passive observation or “honest-but-curious” adversary. Amazon behaves as promised and provides the storage service without going out of its way to undermine customer privacy. They have full visibility into what is going on in the cloud but not on the gateway VM. For example they can not pull system logs from the gateway or observe raw iSCSI traffic received.
  2. Active attacker: in this model anything goes— for example Amazon can send malicious updates to the gateway and tamper with contents after storage, returning different blocks than what the customer wrote. This will be taken up in the section on security.

Let’s start with the first problem. First we observe that any data sent to the cloud has been previously encrypted using a sound full-disk encryption scheme such as LUKS or Bitlocker. As long as this is strictly true, the bits stored at AWS provide no meaningful information about the original contents. There are two important qualifications to this.

First the volume must have been encrypted before any use or if encryption is added after-the-fact, it must have been applied to the entire volume. Otherwise the scheme risks exposing fragments of data leftover from previous, unencrypted content. For example Bitlocker will not encrypt unused space on disk by default, assuming that it is blank. That assumption does not hold if the disk has been in use— in that case “free space” on disk could actually hold leftover, unencrypted data from previously deleted files that is now exposed to cloud provider.

Conversely, when disk encryption is applied from the start but under the more generous assumption that unused space need not be encrypted, the “encrypted” disk image will contain a large number of all zero blocks. This allows the storage provider to observe exactly how much of the disk is being used at any given time. It also allows the provider to make inferences on new storage: for example if a new 1MB file is added to the filesystem, zero blocks totaling approximately that much space will be overwritten with random data. (Also note that changing this has an effect on the bottom line: ASG only charges for “used” space in the remote volume. Encrypting the entire disk including empty space will result in paying for total capacity out of the gate, instead of only the fraction in use.)

The previous observation about observing write-access in the cloud applies more broadly than merely distinguishing used vs unused space. Recall that gateway VMs synchronize writes to the cloud. Corollary: Amazon can observe which blocks on disk are being modified. While they have no idea what the contents of those blocks are, they could soon build a list of which blocks change more frequently and if there are specific access patterns such as block #5 always being written immediately after block #8. This alone may allow fingerprinting of remote operating system and applications; filesystems are constantly busy doing book-keeping on file meta-data even when clients are not explicitly writing. (For example, the last-access time of a file gets modified whenever the file is read.) Similarly background tasks such as those responsible for indexing disk contents for full-text search can leave tell-tale signs in their access pattern. Note that filesystem type is not hidden at all: both LUKS and Bitlocker use a well-defined header structure that would be visible to the cloud provider.

By contrast read operations are not visible to the cloud, being served directly from the local gateway VM. Amazon does not learn which blocks are requested since those requests are not propagated up to the cloud.

One bit of good news is that full disk encryption in this situation really means full-disk: unlike the case of encrypting boot volumes, here 100% of storage can be encrypted. By contrast a volume that contains the operating system must have some unencrypted partitions to make the boot sequence possible— including one containing the FDE implementation that can decrypt the rest of the operating system.

Security: what Amazon could see

The picture gets more complicated once we expand the threat model to take into account active attacks from AWS going rogue. In particular, all gateway software is controlled by Amazon and supports remote updates. Under normal circumstances they are considerate enough to give notice and ask for permission before applying updates:

Screen Shot 2019-07-20 at 10.12.33 AM.png

AWS Console makes it clear the company can remotely force update gateway VMs

But this is merely a courtesy; a conservative assumption is that the company can remotely install malicious software on any gateway anytime. This is not because of the restricted shell given by default— since the gateway is distributed as a VM, it is trivial to modify its virtual disk image and obtain root on the box for looking around:


AWS Storage Gateway uses a Java application and some glues scripts to present iSCSI interface to local clients while synchronizing contents to S3

A determined customer could install additional software or change the configuration of the VM, and possibly disable remote updates. But the problem is without fully reverse-engineering every piece of software on the gateway, it is still not possible to rule out the possibility of a backdoor inserted by AWS. For this reason, we will model the gateway VM as a blackbox controlled by the adversary.

In that threat model denial-of-service can not be prevented: adversary can always brick the gateway or decline to serve any data. But this is not any different from AWS carrying out the same attack in the cloud. Nothing prevents Amazon from deleting customer data from S3 or holding it hostage, other than the fear of PR reprisals or future impact on lost revenue.

Access patterns: ORAM threat model

With control of the gateway, Amazon can also observe read operations, in addition to observing writes from the vantage point of the cloud. This begins to look a lot like the oblivious RAM (ORAM) threat model in theoretical cryptography: an adversary observes all memory accesses made by a program and attempts to deduce information about the program— such as secret inputs— from that pattern alone. Quantifying such risks in the context of remote storage is difficult. It is common for volatile memory access to be dependent on secrets. For example, bits of an RSA private key are often used to index into a lookup table in memory, resulting in different access patterns based on the key. That property is directly exploitable for side-channel attacks. It is much less common to use persistent storage that way. Nevertheless there is information contained in the sequence of blocks read or written that the gateway learns. For example, it can highlight “hot spots” on disk containing frequently accessed contents. It may even be possible link an iSCSI volume to a particular service by observing a correlation between requests. Contrived example: suppose a customer is running a web-server backed by the iSCSI volume on a gateway. Sending HTTP requests to the server to download different files and observing resulting iSCSI requests to the gateway VM could build a convincing case that this site is serving up content straight out of the gateway.




Revocation is hard, the code-signing edition (part I)

Issuance vs revocation

Certificate revocation remains the Achilles heel of public-key infrastructure systems. The promise of public-key cryptography was stand-alone credentials that can be used in a peer-to-peer manner: a trusted issuer provisions the credential and then steps aside. They are not involved each time the credential is used to prove your identity. Contrast that with prevalent models on the web today such “login with Facebook” or “login with Google.” These are glorified variants of an enterprise authentication system called Kerberos that dates from the 1980s. Facebook or Google sit in the middle of every login, mediating the transaction. In the PKI model, they would be replaced by a trusted third-party, known as the “certificate authority,” issuing long-lived credentials to users. Armed with that certificate and corresponding private key, the person could now prove their identity anywhere without having to involve the CA again. This is the model for proving identities for websites: the TLS protocol, that once ubiquitous lock-icon on the address bar or green highlight around the name of the website, is predicated on the owner of that site having obtained a digital certificate from a trusted CA.

Revocation throws a wrench in that simple model. Credentials can expire but they can also be revoked: the website could lose their private-key or one of the assertions made in the certificate (“this company controls the domain”) may stop being true at any point. There are two main approaches to revocation not counting a home-brew design used by Google in Chrome, presumably because the relevant standards were not invented in Mountain View. The first one is Certificate Revocation Lists or CRLs: the certificate authority periodically creates a digitally signed list of revoked certificates. That blacklist is made available at a URL that is itself encoded in the certificate, in a field appropriately called the CDP, for “CRL Distribution Point.” Verifiers download CRLs periodically— each CRL defines its lifetime, to help with caching and avoid unnecessary polling— and check to see if the serial number of the certificate they are currently inspecting appears in the blacklist. There are additional optimizations such as delta-CRLs that can be used to reduce the amount of data transferred when dealing with large-scale PKI deployments where millions of certificates could have a revoked status at any time.

If CRLs are the batch mode for checking revocation, Online Certificate Status Protocol is the interactive version. OCSP defines an HTTP-based protocol for verifiers to query a service affiliated with the CA to ask about the current status of a certificate. Again the location of that service is embedded in the certificate in the “Authority Information Access” field. Interestingly OCSP responses are also signed and cacheable with a well-defined lifetime, which allows optimizations such as OCSP stapling.

Taken in isolation, the cost of doing a revocation check may seem daunting: web-browsers try to shave milliseconds from the time it takes to render a web page or start streaming that video. Imagine having to pause everything while a CRL is downloaded in the background or OCSP status queried. But recall that this cost is amortized over thousands of certificates: one CRL covers not just one website but all websites with a certificate from that CA. Considering that a handful of public certificate authorities account for the majority of TLS certificates, not having a fresh CRL already present would be the exceptional scenario, not the norm. More importantly the retrieval of this information need not be done in real-time: for example Windows has sophisticated heuristics to download CRLs for frequently used CAs in the background. It can even prefetch OCSP responses for frequently visited individual sites, based on a predictive model that those sites are likely to be visited again.

Weakest link in the (certificate) chain

There is one aspect of PKI operation that no amount of client-side optimizations can compensate for: CA incompetence. If CAs are not promptly acting on reports of lost credentials, failing to maintain CRLs correctly or experiencing frequent outages with their OCSP responders, relying-parties will at best remain in the dark about the status of certificates— having to make a difficult decision on what to do if revocation status can not be determined— or worse, reach the wrong conclusion and make trust decisions based on a bogus credential. Cases of CA ineptitude in the case of TLS certificates are legion, and need not bear repeating here other than to point out an economic truth: it would be astonishing to expect a different outcome. All CAs trusted by popular web-browsers are effectively on equal standing; there is nothing distinguishing a certificate issued by a competent CA from one that constantly screws up. In both cases the websites get their all-important lock icon or avoid being publicly shamed by Google Chrome. (Extended validation certificates were introduced a decade ago with better vetting requirements, but that only had the effect of creating two tiers. All EV-qualified issuers still stand on equal ground in terms of trust afforded by browsers.) That means issuers can only compete on price. There is zero incentive for CAs to compete on operational competence, beyond the minimal baseline required by the CA/Browser Forum to avoid getting kicked out. It turns out even that modest baseline is too high a bar: Symantec managed to get kicked out of both Mozilla and Google roots.

Operational errors by certificate authorities have been well documented in the context of TLS certificates. From relatively “minor” infractions such as issuing certificates to unauthorized parties pretending to be affiliated with major websites to epic failures such as accidentally signing subordinate certificates—effectively delegating ability to issue any certificate to the bad guys— public PKI has been veritable zoo of incompetence. It turns out that code signing and revocation create even more novel ways for CAs to stumble.

Same mathematics, different purposes

A public-key algorithm such as RSA can be used for different purposes such as encryption, authentication or digital signatures. The underlying computations are same: whether we are using RSA to protect email messages in transit or prove our identity to a remote website, there is a secret value that is used according to the same mathematical function. While  there are superficial differences in formatting and padding, there is no reason to expect a fundamental difference in risk profile. Yet each of these scenarios have different consequences associated with loss or compromise of keys. (To be clear,  “loss” here is used to mean a key becoming permanently unusable, both for the legitimate owner and any potential attacker. By contrast “key compromise” implies the adversary managed to get their hands on the key.)

Consider the case of a key used for authentication. In a realistic enterprise scenario, this could be a private key residing on a smart-card used for logging into Activity Directory. If that key is rendered unusable for some reason— let’s say the card inadvertently went through a cycle in the washing machine— the user will be temporarily inconvenienced. Until they replace the credential, they can not access certain resources. Granted, that inconvenience could be significant. Depending on group policy, they may not even be able to login to their own machine if smart-card logon is mandatory. But as soon as the IT department issues a new card with a new set of credentials, access is restored. The important point is that the new card will have equivalent credentials but different key-pair. Previous key was permanently bound to the malfunctioning hardware. It can not be resurrected. Instead a new key-pair is generated on the replacement card and a corresponding new X509 certificate is issued, containing the new public key. As far as the user is concerned, the new key is just as good as the previous one: there is nothing they are prevented from doing that used to be possible with the previous key material.

It would be an entirely different story if that same smart-card was used for email encryption, for example S/MIME which is popular in enterprise deployments. While we can reissue this employee a new card, the replacement can not decrypt email encrypted to the previous key. If there was no backup or key-escrow in place, there has been permanent loss of functionality: past encrypted correspondence has become unreadable. That goes beyond a “temporary inconvenience” IT department can resolve. This is one example of a more general observation: loss or compromise of a key can have different consequences depending on the high-level scenario it is used for, even when all of the cryptographic operations are otherwise identical.

Loss vs compromise

At first glance, the signature scenario looks similar to authentication as far as the consequences from key loss are concerned. If a signing key becomes unusable, simply generate a new one and start signing new messages with that replacement key. This is indeed the approach taken in systems such as PGP: users still have distribute the new key, perhaps by publishing it on a key-server. There is no need to go and re-sign previous messages: they were signed with a valid key, and recipients can continue to validate them using the previous key. (Note we are assuming that recipients are flexible about choice of signing key. There are other systems— notably cryptocurrencies such as Bitcoin— where control over funds is asserted by signing messages with one particular key which can not be changed. In that model, loss of the private key has serious repercussions, directly translating to dollars.)

The picture gets murky once again when considering compromise of the key, defined as adversaries coming into possession of the secret. For encryption keys, the results are straightforward: attackers can read all past messages encrypted in that key. The best-case scenario one can hope for is publicizing the replacement quickly, to prevent any more messages from being encrypted to the compromised key. Not much can be done to save past messages, beyond purging local copies. If an attacker can get hold of those ciphertexts by any avenue, they can decrypt them. There is no way to “revoke” all existing copies of previously encrypted ciphertexts. In fact the adversary may already have intercepted and stored those ciphertexts, optimistically betting on future capability for obtaining the decryption key.

Consequences from the compromise of a signing key are more subtle, because time enters into the picture as a variable. Authentication commonly occurs in real time and once, with a single known counter-party attempting to verify your identity. Signatures are created once and then verified an unbounded number of times by an open-ended collection of participants indefinitely into the future. That creates two complications when responding to a key compromise:

  • Future messages signed by the attacker using the compromised key must be invalidated, such that recipients are not accidentally tricked into accepting them
  • Yet past messages made by the authorized signer when they still had full control over the private-key must continue to validate properly.

To appreciate why that second property is critical, consider code-signing: software publish digitally sign the applications they distribute, to prevent malware authors from tricking users into installing bogus, back-doored versions imitating the original. Suppose a publisher discovers that their signing key has been compromised and used to sign malware— a very common scenario that figured prominently in high-profile campaigns such as Stuxnet. The publisher can quickly race to revoke their signing certificate. Client properly implementing revocation checks will then be protected agains the signed malware, rejecting the signature as invalid. But what about legitimate pieces of software that distributed by the same publisher? Indiscriminate revocation would result in all of those apps also becoming untrusted, requiring the publisher to not only re-sign every app they ever shipped but also guarantee that every single copy existing anywhere a customer may download apps from is replaced by the updated version.



Cloud storage with end-to-end encryption: AWS Storage Gateway (part I)

Cloud storage sans surveillance-capitalism

This post picks up on a series of experiments from 2013, originally written in the aftermath of the Snowden disclosures. These experiments started with one question: how feasible is it to use cloud storage services as a glorified remote drive, without giving the service provider any visibility into data stored? This approach stands in stark contrast to how most cloud providers would prefer their services to be used. Call it the Surveillance Capitalism approach to storage: Google Drive, Dropbox and Microsoft One Drive all operate in terms of individual files. While each provider may proudly tout their encryption-in-transit and encryption-at-rest to protect those files as they bound around the internet, they all gloss over one inconvenient detail: the provider has access to the contents of that file. In fact the whole business model is predicated on being able to “add value” to contents. For example if it is a PDF, index the text and allow searching by keywords across your entire collection of documents. If it is a photograph, analyze and automatically tag the image with names using facial recognition. For the most part, all of these applications require access to the cleartext content. While there is a nascent research field for working with encrypted data—where the service provider only has access to encrypted contents but can not recover the original plaintext— these applications are largely confined to a research setting. “Trust us,” the standard Silicon Valley bargain goes: “we need access to your data so we can provide valuable services at zero (perceived, upfront) cost to you.

This approach underserves a vocal segment of consumers who are uncomfortable with that trade-off, who would gladly dispense with these so-called value adds or pay a premium in exchange for better privacy. Specialized services cropped up in the aftermath of Snowden revelations catering to that segment, promising to provide encryption-at-rest with keys only held by the customer— Bring Your Own Keys or BYOK model. Yet each offering was accompanied by the baggage of home-brew design and proprietary clients required to access data protected according to that model. This made integration tricky, because protecting remote data looked nothing like protecting local data. Each platform already has a de facto standard for encrypting local disk drives: Bitlocker for Windows, LUKS for Linux and Filevault on OSX.  Their prevalence lead many individuals and organizations to adopt key management strategies tailored to that specific standard, designed to achieve desired security and reliability level. For example an organization may want encryption keys rooted in hardware such as TPM while also requiring some recovery option in case that TPM gets bricked. Proprietary designs for encrypting remote storage are unlikely to fit into that framework or achieve the same level of security assurance.

AWS Storage Gateway

AWS Storage Gateway product is hardly new. Part of the expanding family of Amazon Web Services features, it was first introduced in 2012. Very little has changed in the way of high-level functionality— this blog post could have been published seven years ago. While AWS also provides file-oriented storage options such as S3 and Glacier, ASG operates on a different model: it provides an iSCSI interface, presenting the abstraction of a block device. An iSCSI volume is accessed the same way a local solid-state or spinning drive would be addressed in terms of chunks of storage: “fetch the contents of block #2” or “write these bits to block #5”). One corollary is that existing operating system features that work on local disks also work on remote iSCSI volumes— modulo minor caveats. In particular full disk encryption schemes can be used to encrypt them in the same way they can encrypt local drives— an earlier blogpost walked through the example of Windows Bitlocker-To-Go encrypting an iSCSI volume using smart-cards.

The “gateway” itself is either a dedicated hardware appliance available for sale or virtual machine that can be hosted on any common virtualization platform. Management of the appliances is split between the AWS Management Console and a restricted shell running on the appliance itself.

Screen Shot 2019-07-15 at 7.27.15 AM.png
Gateway VM on VMware Fusion, logged into the restricted shell for configuration

iSCSI beyond the local neighborhood

ASG represents more than a shift from one protocol to another more convenient protocol. After all, no one needed help from Amazon to leverage iSCSI; it is commodity technology dating back two decades. It does not even require specialized storage hardware. Windows Server 2012 can act as a virtual iSCSI target, providing any number of volumes with specific size that can be accessed remotely. So why not launch a few Windows boxes in the cloud— perhaps at AWS even— create iSCSI volumes and call it a day?

The short answer is iSCSI is not designed to operate over untrusted networks. It provides relatively weak, password-based initial authentication and more importantly, provides no security on the communication link. The lack of confidentiality is not necessarily a problem when one assumes data itself is already encrypted, but lack of integrity is a deal breaker: it means an adversary can modify bits on the wire, resulting in data corruption during reads or writes. Granted, sound full-disk encryption (FDE) schemes seek to prevent attackers from making controlled changes to data. Corrupted blocks will likely decrypt to junk instead of a malicious payload. But this is hardly consolation for customers who lose valuable data. For this reason iSCSI is a better fit inside trusted local networks, such as one spanning a datacenter. On the other hand, if servers providing those iSCSI targets are inside the datacenter, one has not achieved true “cloud storage” in the sense of having remote backups— now those serves have to be backed up some place else in the cloud, outside the datacenter.

ASG provides a way out of that conundrum. On the front end, it presents a traditional iSCSI target that other devices on the network can mount and access. While that part is not novel, the crucial piece happens behind the scenes:

  • ASG synchronizes contents with AWS S3. That channel does not use iSCSI; otherwise tit would be turtles all the way down. Instead Amazon has authored a custom Java application that communicates with Amazon using an HTTP-based transport protected by TLS.
  • ASG also has intelligent buffering to synchronize write operations in the background, based on available bandwidth. To be clear, ASG maintains a full copy of the entire disk. It is not a caching optimization designed to keep a small slice of contents on frequency of access. All data is locally present for read operations. But writes must propagate to the cloud and this is where local, persistent buffering provides a performance boost by not having to block on slow and unreliable network connections to the cloud. If the VM crashes before incoming writes are synchronized to the cloud, it can pick up and continue after the VM restarts.

Encrypted personal cloud storage


Here is a simple model for deploying AWS Storage Gateway to provide end-to-end encrypted personal storage on one device. This example assumes Windows with a virtualization platform such as Hyper-V or VMware Workstation:

  • AWS Storage Gateway will run as a guest VM. While ASG is very much an enterprise technology focused on high-end servers, its resource requirements are manageable for moderate desktops and high-end laptops. iSCSI is not particularly CPU intensive but AWS calls for ~8GB memory allocated to the VM, although the service will run with a little less. It is however more demanding of storage: buffers alone require half terabyte of disk space even when the underlying iSCSI volume itself is only a handful of GB. AWS software will complain and refuse to start until this space is allocated. Luckily most virtualization platforms support dynamically resized virtual disk images. The resulting image has an upper bound on storage, but only takes up as much space as required to hold current contents. This makes it possible to appease the gateway software without dedicating massive amounts of space upfront.
  • Configure networking to allow inbound connections to the VM over the internal network shared between host & guests. The VM must still have a virtual adapter connected to an external network, since it needs to communicate with AWS. But it will not be allowed to accept inbound iSCSI connections from that interface.
  • Use the Windows iSCSI initiator to access the iSCSI target over the local virtual network shared between hosts & guests
  • After the disk is mounted, create an NTFS-formatted volume and configure Bitlocker disk encryption as usual. Windows Disk Manager utility treats the AGS volume as ordinary local disk. In fact the only hint that it is not a vanilla hard drive is contained in the self-reported device name from the gateway software.

Windows Disk Management view of an iSCSI volume hosted by AWS Storage Gateway

This model works for personal storage but poses some usability problems. In particular it requires a local VM on every device requiring access to cloud storage and does not allow concurrent access from multiple devices. (Mounting the same iSCSI target from multiple initiators in read/write mode is an invitation to data corruption.) The next post will consider a slightly more flexible architecture for accessing cloud data from multiple devices. More importantly we circle back to the original question around privacy: does this design achieve the objective of using cloud storage as a glorified drive, without giving Amazon any ability to read customer data? Considering that the AWS Storage Gateway is effectively blackbox software provided by Amazon and accepts remote updates from Amazon, we need to carefully evaluate the threat model and ask what could happen in the event that AWS goes rogue or is compelled by law enforcement to target a specific customer.




Ethereum mixing with RSA: getting by without zero-knowledge proofs

Old-fashioned and unpopular as RSA may have become, it is a versatile public-key crypto-system. Starting with Bitcoin, cryptocurrencies have shunned RSA in favor of  signatures based on elliptic curves, initially ECDSA and later moving towards pairing-based cryptography. Ethereum is the lone exception, having added native RSA support with EIP-198. “Native” being the operative keyword. In principle the smart-contract language used by Ethereum is Turing-complete and can implement any computation. In reality computations are bounded by the amount of gas charged which creates two types of limitations. First is a hard-limit by the maximum amount of gas that can be consumed by all transactions in one block, no matter how motivated a user may be to get a complex transaction mined. The second one is a soft, economical incentive to favor cheaper computations when possible. ECDSA signature verification is artificially “cheap” because it is not implemented as ordinary EVM bytecode. Instead it is a special external contract that can be invoked by anyone, at a deeply discounted price compared to what it would have cost to implement the same complex operation from scratch. EIP-198 brings RSA into this model, although the discount is not quite as deep; on any reasonable hardware architecture RSA signature verification is much faster than ECDSA verification. But the arbitrary gas pricing set by EVM inexplicably charges more for the former.

Strictly speaking EIP-198 adds support for modular exponentiation, which is a useful primitive operation that enables more than only RSA signatures. For example El Gamal over the integers and other crypto-systems based on discrete-logarithms are now in the realm of possibility. More importantly, RSA has useful mathematical properties that are notably absent from ECDSA and enable new scenarios. This post covers one example: trustless mixing of Ethereum.

Background on mixers

Mixers are designed to improve privacy by shuffling funds on the blockchain in such a way that sources and destinations are not connected. To take a concrete example: suppose Alice and Bob have 1 ETH each, stored at known addresses. These are the inputs. They would like to shuffle these funds such that each one ends up with 1ETH again— perhaps minus some transaction fees— but at new addresses which are not linkable to original ones by anyone else with full view of the blockchain. These are the outputs. The choice of 1 ETH is arbitrary but it is important that everyone contributes the same amount when there are exactly as many inputs as outputs. Otherwise the quantity itself becomes a signal, deanonymizing the link between inputs and outputs. If Alice contributed a larger input, the larger output also belongs to her.

With two people, the privacy improvement is marginal: since there are only two addresses to begin with, any given output from this process could only have come from exactly one of two original inputs. The best case scenario one could achieve in this setting is that any outside observer will have no more than 50/50 chance of guessing which way the inputs got shuffled. Also Alice and Bob will always know exactly which input corresponds to which output since they control the other one. As the saying goes, privacy loves the company of large numbers.  Uncertainty in the origin of funds increases with more participants.

But an even more important criteria is soundness. Informally stated, participants will not lose money under any circumstance during the execution of the protocol. Here is a straw-man example that fails on that dimension:

  1. Alice informs Bob her preferred destination Ethereum address
  2. Bob responds with his choice of address
  3. Alice and Bob flip a coin to decide how to map inputs to outputs
  4. If  the coin lands heads, Bob sends 1ETH to Alice’s address. Otherwise he sends the same amount to his own address
  5. After observing Bob’s transaction confirm on the blockchain, Alice in turn sends 1ETH either to Bob or herself

This protocol has a glaring weakness. If Alice receives 1ETH from Bob after step #4, she can diverge from the script and keep the extra ether for herself. For that matter, Alice may be unable to complete the protocol: suppose her hardware wallet failed and she no longer has access to any of her private keys. There is nothing to guarantee that steps #4 and #5 are done atomically: either both happen or neither happen. Once started, it must be either run to completion or rolled-back by refunding the money Bob contributed.

While it is possible to repair that flaw using fair-exchange protocols, there is a more fundamental problem with this approach: it assumes that Alice and Bob somehow found each other ahead of time and they have a private channel for coordinating their activities off-chain. This is not a scalable solution, especially when the protocol is generalized to support more than two participants. That only gets worse with scaling beyond two users to design a mixer that can accept hundreds of inputs. All of those actors must coordinate and agree on one permutation of inputs to outputs while minimizing what each person learns— otherwise the mixer is easily defeated by a single rogue participant who infiltrated the group— while guaranteeing that no one will lose funds even if everyone else in the group has conspired against them.

Trusted third-parties as deus ex machina

If we posit the existence of a trusted third-party Trent, trivial solutions emerge:

  1. Alice, Bob, Carol and anyone else interested in mixing funds, sends 1ETH to Trent via Ethereum blockchain
  2. Every participant also sends their preferred output address to Trent. This part must be done privately and can not be additional metadata sent along with ETH contribution in step #1; otherwise the entire world knows the intended shuffle.
  3. Trent sends every participant approximately 1ETH to their preferred address, minus some commission for facilitating this transaction

This protocol loses both soundness and privacy if Trent goes rogue. After collecting all the funds in step #3, he can abscond with them. Similarly Trent has full knowledge of the mapping between inputs and outputs. He can disclose this information at any point in the future, voiding any privacy advantages afforded by a service the participants have already paid for.

Contracts for holding trusted third-parties accountable

Ideally Trent is replaced with an immutable smart contract guaranteed to stick to the script. Once Ethereum has support for zero-knowledge proofs via SNARKs, this can be done in an autonomous way. But old-fashioned RSA alone is sufficient to enable a middle-ground solution: there is still a trusted third-party involved  to facilitate the protocol but they are prevented from stealing funds or deanonymizing the permutation.

Protocol sketch

Trent publishes a new contract with known EVM bytecode and deposits some funds as “earnest money.” This is effectively a way to incentivize Trent into behaving properly. If Trent executes the protocol faithfully, he will recover the deposit along with any commissions taken for the service. If he diverges from the protocol, the funds will be distributed to participants. Trent generates a unique RSA key-pair and embeds the corresponding public-key in the contract.  Once the contract is initialized, execution operates in three stages: collection, opening and distribution, each with specific deadlines that are hard-coded into the contract. In the first stage, the contract collects funds along with preferred output addresses from participants.

  1. Every participant interested in using the mixer sends 1ETH to the contract, along with a message to be signed by Trent. The contract records the deposit along with the sending address, but does not consider it finalized until Trent has acknowledged it.
  2. Trent must call a different contract method and provide an RSA signature over the requested message. The contract can verify this RSA signature to determine whether Trent signed correctly and only then consider the contribution final.
  3. If Trent fails to complete that step after some deadline, the participant can request a refund (The contract could even be designed to penalize Trent for discriminating against participants, by sending extra funds taken from the initial deposit.)

In the second stage, users reveal their choice of destination address along with a signature obtained from Trent in step #2 above. This is done by invoking another method on the contract to provide the address along with a signature from Trent by calling the contract. One subtlety related to blockchains: participants must use a different Ethereum address when interacting with the contract in each stage. Otherwise the origin of the TX itself allows linking the revealed address to the original request for signing.

Blinded by randomness

This is where unique properties of RSA shine. If users were submitting the destination address verbatim in step #1, it would defeat the point of using a mixer— everyone can observe these calls and learn which address each person picked. But RSA enables blind signatures: Trent can sign a message without known what message he signed. The crucial property is that the message submitted for signing is related to the real message the participant wants signed namely, the destination address. But that relationship is only known to the participant: as far as everyone watching the blockchain is concerned, it appears indistinguishable from a random message. Not even the party in possession of the signing key (Trent, in this case) learns the “true” message being signed or can correlate a signed message presented in the future with the original request on the related message. In effect, users mask their address before submitting it to Trent for signature and then recover the intended signature by using properties of RSA. (There is some additional complexity being glossed over: signatures are computed over the address directly. Instead the address is hashed and suitably padded with a scheme such as PSS. Otherwise RSA allows existential forgery such that one can find unlimited message-signature pairs, although these messages will not have any particular meaning as far as corresponding to an ethereum address.)

To avoid waiting on stragglers indefinitely, a deadline is imposed on the second stage. After this deadline is reached, the contract can start actually disbursing funds to the addresses opened in the second stage. There is no concept of a “scheduled task” on the Ethereum blockchain so the final stage will be initiated when any participant— including potentially Trent— calls into the contract to request distribution after the deadline has elapsed. At this point the contract can confirm that the timestamp or block height is past that deadline and start sending ether to previously verified output addresses.

Detecting and punishing dishonest operators

There is one flaw in the protocol as described: Trent can cheat by issuing extra signed messages. Recall that the presence of a valid RSA signature on a message authorizes the disbursal of 1 ETH to the address contained in that message. That turns every valid signature into a check worth 1ETH. While every participant is supposed to receive one signature for every ETH contributed, nothing prevents Trent from issuing signatures over his own addresses and attempting to cash these in.

This is where the initial “earnest money” contributed by Trent comes in, combined with the deliberate delay in releasing funds. Recall that that funds are not disbursed immediately when a participant calls into the contract with a valid signature. Instead the contract waits until a predetermined deadline, giving everyone a chance to chime in with their address. As long as every honest participant supplies their address, cheating will become obvious. The contract has a record of exactly how many participants have received valid signatures in response to their contribution. If more signatures turn up than participants, it is clear Trent has diverged from the protocol. It could be a deliberate attempt to steal funds or perhaps Trent’s RSA signing key was compromised. Either way, the contract will respond by “refunding” all ether to the original input addresses, along with a sjare of the earnest money put up by Trent. In this case the protocol execution has failed: no mixing has occurred. But all participants receive 100% of their original contribution back along with some compensation for Trent’s incompetence/dishonesty.

Participants do not have to do anything special to catch and punish Trent for cheating, or monitor the blockchain. As long as they are following the protocol and revealing their own signed address before the deadline, any attempt by Trent to fabricate additional unsolicited signatures will backfire and result in “gifting” money. The only way Trent can get away with cheating is if some participants have failed to reveal their signed address in a timely manner, effectively abandoning their funds. Even in that scenario, Trent would be disincentivized from claiming those funds with forged signatures: he would be taking the risk that missing participants may turn up at the last minute and trigger the retaliation logic.



DIY VPN (part II)

[continued from part I]

Economics of a self-hosted VPN

This section considers the overall cost of operating a VPN server. First, we can rule out software acquisition costs. On the server side, the community edition for OpenVPN is freely available as open-source software. (The commercial dubbed OpenVPN Access Server comes with a per-user licensing model. While adding useful fit-and-finish including configuration GUI, that version does not change the fundamental scaling or security properties of the system.) On the client side, there are freeware options on all major platforms. One exception is when ADCS is leveraged for the certificate authority. Even there one can lease hourly access to a private Windows server from major cloud providers, rolling that into operational expenses instead of paying outright for the software. As far as operational expenses are concerned, what follows is a back of the envelope calculation using AWS as the infrastructure provider. While AWS has a free-tier for new customers, this analysis assumes that discount has already been exhausted, in the interest of arriving at a sustainable cost basis.


OpenVPN runs on Linux and is not particularly CPU intensive at the scale envisioned here— supporting a handful of users with 2-3 simultaneous connections at peak capacity. A single t3-nano instance featuring a single virtual CPU and half GB of RAM is perfectly up to task for that load. Based on current AWS pricing, these cost roughly half cent per hour when purchased on demand in US regions such as Virginia or Oregon. The operational cost can be reduced by committing to reserved instances. Paying upfront for one-year takes the price down to 0.30¢/hour while a three year term yields 0.20¢/hour. Even greater savings and flexibility may be achieved with what AWS terms  spot instances: current spot-prices for nano instances hover around 0.16-0.20¢ per hour depending on availability zone. However spot instances comes with the risk that the server may be preempted if demand spikes to the point where prices exceed the bid. That level of reliability may still be acceptable for a personal setup, since one can always bid on another spot instance with higher price. Conservatively using the 0.20¢ estimate, we arrive at EC2 instance costs just shy of $1.50 per month.


It turns out that the cost of running an instance is not the primary consideration, because of the modest computational requirements of openvpn. When using minimal instance types such as t3-nano, disk space becomes a significant contributor to cost. At the time of writing, AWS charges 10¢ per GB/month for entry-level SSD volumes attached to instances. Costs only go up from there for more demanding loads and guaranteed IOPS. (While there are cheaper options such as spinning disks, those can only be used in addition to existing boot volume and not as replacement.) That means simply keeping a 15GB disk around for a month alone exceeds the cost of running the instance for that same period.

Fortunately Linux distributions are also modest in their demands for disk space requirements. Minimum requirements for Ubuntu18 servernot the desktop edition, which is considerably more resource hungry— weight in at a svelte 256MB of memory and 1.5GB of disk space. It is not possible to translate that directly into an AWS image: all popular AMIs offered by Amazon come with minimum 8GB disk space, more than five times the stated baseline. AWS will not allow launching an EC2 instance based on one of these AMIs with a smaller boot volume.


There is a work around: AWS also allows customers to import their own virtual machine images from common virtualization platforms. That permits crafting a properly slimmed-down Linux image locally and converting it into an AMI.

Fair warning: this process is clunky and error-prone. After uploading an image to S3 and initiating the conversion as batch process, several minutes can elapse before AWS reports a problem (typically having to do with the image format) that takes users back to square one. It turns out only OVF format images accepted. VMware distributions including Fusion on OSX can export to that format using the included ovftool.

With some buffer to accommodate additional software inserted by AWS into the image, log files and future updates, we end up with an Ubuntu18 server AMI that can squeeze into 3GB SSD volume. That comes out to a recurring storage expense of 30¢ per EC2 instance. There is also the storage associated with the AMI itself, one copy per region. That carries the slightly lower price-tag charged to EBS snapshots and can be amortized across multiple instances launched in the same region. Assuming the worst case scenario of a single server, we arrive at an upper bound of 50¢ per month.


While computation and storage expenses are highly competitive with typical prices charged by commercial VPN services, bandwidth is one area where Amazon is much less favorable for operating 24/7 VPN. Amazon charges 9¢ per GB for outbound traffic, defined as bits heading out of AWS infrastructure in that region. A VPN server has an unusual bandwidth symmetry. Most servers receive a small amount of data inboundsuch as a request for a web pageand  respond back with a large amount of data outbound, for example a high-resolution image or video stream. But a VPN server is effectively acting as proxy that shuttles traffic in both directions. Every incoming request from the client is forwarded to its true destination, becoming outbound traffic. Likewise every inbound response from that destination is sent back to the client as outbound traffic. So the total traffic volume for purpose of metering is closely approximated by the upstream and downstream traffic generated by users connected to the VPN, plus a small overhead introduced by the VPN protocol itself.

The average US household consumes 190GB of bandwidth per month according to the latest iGR report from 2017. While that report does not distinguish upstream/downstream, it is reasonable to assume downloading is responsible for the bulk of this figure. Adjusting by 50% for projected growth and including another 50GB for mobile devices with their own data plan yields a number around 300GB month of traffic. If 100% of that traffic is routed over the VPN service hosted at AWS, bandwidth costs alone would exceed all other other operational expenses by a factor of ten.

This cost scales linearly with the amount of VPN usage, or equivalently the number of users sharing the VPN. At large enough scales, the same also holds true for EC2 costs, since a single nano instance can not service an unbounded number of subscribers. But those limits are unlikely to be reached for a system intended for a handful of people, while the impact of additional bandwidth usage will be directly reflected in the monthly bill.

Verdict on self-hosting

AWS turns out to be highly competitive for hosting a VPN solution, with the caveat that only moderate usage is required. Considering that popular VPN services for consumers charge in the neighborhood of $3-$10 per month, a single nano instance can offer a better solution at lower cost, especially when the server is shared by multiple people. Advantages to self-hosting include:

  • Eliminating trust in a third-party service, including random client apps. VPN provider is in a position to observe traffic metadata, such as websites visited and frequency For unenlightened websites not using HTTPS, VPN service can even observe the full traffic exchange. Unscrupulous providers have taken advantage of this, most recently Facebook with its Onavo spyware product masquerading as VPN. Hosting your own VPN avoids that dependency, although one must still implicitly trust the infrastructure provider. Amazon has visibility into EC2 traffic which can be used to identify activity: “user coming from IP address connected to the VPN server, and that resulted in VPN server reaching out to websites at address” While one can be confident AWS (unlike Facebook) respects privacy and will not randomly mine those logs to spy on their own customers, they can still be compelled to disclose records by law enforcement.
  • High-availability with the backing of Amazon infrastructure. Hardware, storage and networking failures are low probability events.
  • Geodistributed points of presence, with ability to host VPN servers in the US, Europe, South America and Asia.
  • Large selection of IP4 and IP6 addresses from the AWS range, compared to hosting from a residential connection.
  • More importantly, freedom to change IP addresses at will. New elastic IP can be requested from AWS and assigned to an instance anytime the service operator wants to discard their association with the existing address
  • Unlimited number of devices at no additional cost (within the capacity limits of the EC2 instance)

The self-hosting route also comes with two notable caveats:

  • Overhead of maintaining the service, particularly the PKI necessary for issuing/revoking certificates. OpenVPN does not help matters here: it does not directly implement CRL or OCSP for retrieving revocation information. It requires the presence of a CRL file deployed locally. (But one can get the same effect by crafting a custom script to be executed server-side for each connection, to invoke openssl for proper chain validation that pays attention to CDP or OCSP responder links in the certificate. Alternatively a cron job can periodically retrieve CRLs and update the local file openvpn expects to find this information.)
  • Steep pricing for bandwidth. This is less an argument against the DIY approach and more a cautionary note about using AWS for hosting, which is the obvious choice along with Azure and GCE. It turns out Azure and GCE are not that different when it comes to egress traffic. More cost-effective plans are available from alternative providers such as Linode and Digital Ocean, featuring upwards of 1 terabyte egress traffic for the fixed monthly price of server hosting.



DIY VPN (part I)


Net neutrality is not the only casualty of an FCC under the sway of a new administration. In a little noticed development, ISPs have free reign to spy on customers and sell information based on their Internet traffic. Google has been on a quest to encrypt all web traffic using the TLS protocol— engaging in subtle games of extortion by tweaking the default web browser UI to throw shade at sites still using the default unencrypted HTTP protocol. But there is one significant privacy limitation with TLS: it does not hide the identity of the website the user is visiting. Such metadata is still visible, both in the TLS protocol but also in the addressing scheme used for the underlying IP protocol.

This is where virtual private networks come in handy. By encapsulating all traffic through an encrypted tunnel into another connection point, they can also protect metadata from the prying eyes of surveillance-happy ISPs. But That is only the beginning: VPNs also reduce privacy risks from arguably the apex predator of online privacy— advertising-based revenue models. In the grand tradition of surveillance capitalism, every crumb of information about a customer is enlisted into building comprehensive profiles. Even if consumers never login to Facebook or Google, these networks can still build predictive profiles using alternative identifiers that help track consumers over time. Of all the stable identifiers that users carry around like unwanted, mandatory name-tags forced on them— HTTP cookies, their counterparts complements of Flash and creative browser fingerprinting techniques— IP addresses are one of the hardest to strip away. Incognito/private browsing modes allow shedding cookies and similar trackers at the end of the session, but IP addresses are assigned by internet service providers. In many cases they are surprisingly sticky: most residential broadband IPs rarely change.

From clunky enterprise technology to consumer space

In the post-Snowden era, the default meaning of “VPN” also changed. Once an obscure enterprise technology that allowed employees to connect to internal corporate networks from home, it shifted into the consumer space, pitched to privacy-conscious mainstream endusers as a way to protect themselves online. In an ironic twist, even Facebook the apex predator of online privacy infringement got into the act: the company acquired the analytics service Onavo and offered the existing Onavo VPN app to teenagers free of charge, in exchange for spying on their traffic. (When TechCrunch broke the news, Apple quickly put the kibosh on that particular privacy invasion by revoking the enterprise certificate Facebook had misused to create an app for general distribution with VPN privileges.) The result has been a competitive landscape for consumer VPN services, complete with the obligatory reviews and ranking from pundits.

For power users there is another option: hosting your own VPN service. This post chronicles experiences doing that, organized into two sections. First section an overview of what goes into building a functioning VPN server, using common, off-the-shelf open source components. Second half looks at the economics of the DIY approach, comparing the costs against commercial offerings and more importantly, looking at how different usage patterns— number of presence points, simultaneous users, monthly bandwidth consumed— affect those numbers.

Implementation considerations

“The nice thing about standards is that you have so many to choose from”

Andrew Tannenbaum

Maze of protocols

One of the first challenges in trying to create a VPN solution is the fragmented nature of the standard. Unlike the web where all web-browsers and web-servers have converged on the protocol TLS for protecting communications, there is no comparable “universal” VPN standard with comparable ubiquity. This is either surprising or utterly predictable depending on perspective. It is surprising considering VPNs are 20+ years old. Attempts to standardize VPN protocols are the same vintage: IPSec RFCs were published in 1995. L2TP followed a few years later. With the benefit of two decades to in protocol tinkering— proprietary, standards-based or somewhere in between— you might expect the ecosystem to converge on some least common denominator by now. Instead VPNs remain highly fragmented, with incompatible options pushed by every vendor. The roots of the technology in enterprise computing may have something to do with that outcome. If Cisco, Palo Alto and Fortinet compete on selling pricy VPN gear to large companies with the assumption that every employee will be required to also install a  VPN client from the same vendor who manufactures the hardware, there is little incentive to interoperate with each other.

In this morass of proprietary blackbox protocols, OpenVPN stands out as one of the few options with some semblance of transparency. It is not tied to specific hardware, which makes for convenient DIY setups using cloud hosting: it is much easier— not to mention cost effective— to install software in a virtual server than rack a physical appliance in a datacenter cage. It also has good cross-platform support client side, with a good mixture of freeware and commercial applications on Windows, OSX and Linux flavors— the last one in particular being the black-sheep platform frequently neglected by commercial VPN offerings. (WireGuard is another more recent entry distinguished by modern cryptography, but it does not quite have the same level of implementation maturity.)


OpenVPN on iPad

OpenVPN server-side

In the interest of reach and portability, this examples uses Ubuntu Server 18 LTS, since many commercial hosting services such as AWS and Linode offer virtual hosting for this operating system. There are extensive online tutorials on setting up the server-side online so this blog post will only summarize the steps here:

  1. Enable IPv4 & IPv6 forwarding via sysctl configuration
  2. Setup iptables rules (again, for IPv4 & IPv6) and make them persistent using iptables-persistent package.
  3. Tweak openvpn server configuration file
  4. Provision credentials
    • Generate Diffie-Hellman parameters
    • Use openssl to generate an RSA key-pair for the server & create a certificate signing request (CSR) based on that pair
    • Submit the certificate request to a certificate authority to obtain a server certificate and install this on the server— more on this in the next section

PKI tar-pit

OpenVPN supports authentication based on different models but the most commonly used design involves digital certificates. Every user of the VPN service as well every server they connect to has a private key and associated credential uniquely identifying that entity. There are no shared secrets such as a single password known to everyone, unlike for example in the case of L2TP VPNs where all users have the same “PSK” or preshared key. One crucial advantage of using public-key credentials is easy revocation: if one user loses their device or stops using the VPN service, their certificate can be revoked without affecting anyone else. By contrast, if an entire company is sharing the same PSK for every employee, it will be painful to rotate that secret for every departure. From an operational perspective, distributing credentials is simpler. Passwords have to be synchronized on both sides in order for the server to verify them. With digital certificates, the secret private key only exists on user device and the server does not need to be informed ahead of time whenever a new certificate is issued. (Although they need access to the certificate revocation list or CRL, in order to check for invalidated credentials.)

Public-key credentials avoid many of the security headaches of passwords but come with the operational burden of managing a PKI. That is arguably the most heavy-weight aspect for spinning up a stand alone VPN infrastructure. Large enterprises have an easier time because the overhead is effectively amortized into other use-cases of PKI in their existing IT infrastructure. Chances are any Window shop at scale is already running a certificate authority to support some enterprise use case. For example employees badges can be smart-cards provisioned with a digital certificate. Or every machine can be automatically issued a certificate using a feature called auto-enroll when they are connected to the domain controller, which is then leveraged for VPN. That seamless provisioning experience is difficult to recreate cross-platform once we accept the requirement to support OSX, Linux, Android and iOS platforms. Instead credentials must be provisioned and distributed manuallyand that includes both VPN users and VPN servers.

There are open-source solutions for running a certificate authority, notably EJBCA. OpenSSL itself can function as a primitive CA. OpenVPN introduced its own solution called easy-rsa, glorified wrapper scripts around OpenSSL to simplify common certificate management tasks.  But arguably the most user-friendly solution here is Windows Active Directory Certificate Services or ADCS. ADCS features both a GUI for simpler use-cases as well as automation through the command-line via old-school certutil and more modern powershell scriplets.


Active Directory Certificate Services GUI on Windows 2012

ADCS is not free software, but it is widely available as cost-effective, hosted solution from Amazon AWS and MSFT Azure among other providers. Considering the CA need only be powered on for the occasional provisioning and revocation, operational expenses will be dominated by the storage cost for the virtual disk image as opposed to the runtime of the virtual machine. One might expect a CA has to stay up 24/7 to serve revocation lists or operate an OSCP responder— and ADCS can do both of those, when the server also has IIS role installed. But OpenVPN can only parse revocation information from a local copy of the CRL. That rules out OCSP and implies CRLs must be manually synchronized to the server, defeating the point of having a server running around the clock to serve up revocation information. In effect OpenVPN only requires an offline CA.