Lessons from the first instant-messaging war
In the late 1990s and early 2000s instant messaging was all the rage. A tiny Israeli startup Mirabilis set the stage with ICQ but IM quickly become a battle ground of tech giants, running counter to the usual dot-com era mythology of small startups disrupting incumbents on the way to heady IPO valuations. AOL Instant Messenger had taken a commanding lead from the gate while MSFT was living up to its reputation as “fast-follower” (or put less charitably, tail-light chaser) with MSN Messenger. Google had yet to throw its hat into the arena with GChat. These IM networks were completely isolated: an AOL user could only communicate with other AOL users. This resulted in most users having to maintain multiple accounts to participate in different networks, each with their barrage of notifications and task-bar icons.
While these companies were locked in what they viewed as zero-sum game for marketshare, the benefits of interoperability to consumers were clear. In fact one software vendor called Trillion even made a multiple-network client that effectively aggregated the protocols for all the different IM services. Standards for interoperability such as SIP and XMPP were still a ways off from becoming relevant; everyone invented their own client/server protocol for instant messaging, and expected to provide both sides of the implementation from scratch. But there was a more basic reason why some IM services were resistant to adopting an open standard: it was not necessarily good for the bottom line. Interop is asymmetric: it helps the smaller challenger compete against the incumbent behemoth. If you are MSN Messenger trying to win customers away from AOL, it is a selling point if you can build an IM client that can exchange messages with both MSN and AOL customer. Presto: AOL users can switch to your application, still keep in touch with their existing contacts while becoming part of the MSN ecosystem. Granted the same dynamics operate in the other direction: in principle AOL could have built an IM client that connected its customers with MSN users. But this is where existing market shares matter: AOL has more to lose from by allowing such interoperability and opening itself up to competition with MSN, compared to keeping its users locked up in the wallet garden.
Not surprisingly then they did go to of their way to keep each IM service an island unto its own. Interestingly for tech giants, this skirmish was fought in code instead of the more common practice of lawyers exchanging nastygrams. AOL tried to prevent any client other than the official AIM client from connecting to its service. You would think this is an easy problem: after all, they control the software on both sides. They could ship a new IM client that includes a subtle, specific quirk when communicating with the IM server. AOL servers would in turn look for that quirk and reject any “rogue” clients missing that quirk.
White lies for compatibility
This idea runs into several problems. A practical engineering constraint in the early 2000s was the lack of automatic software updates. AOL could ship a new client but in those Dark Ages of software delivery, “ship” meant uploading the new version to a website— itself a much heralded improvement from the “shrink-wrap” model of actually burning software on CDs and selling them in a retail store. There was no easy way to force-upgrade the entire customer base. If the server insisted on enforcing the new client fingerprint, they would have to turn away a large percent of customers running legacy versions or make them jump through hoops to download the latest version— and who knows, maybe some of those customers would decide to switch to MSN in frustration. That problem is tractable and ultimately solved with better software engineering. Windows Update and later Google Chrome made automatic software updates into a feature customers take for granted today. But there is a more fundamental problem with attempting to fingerprint clients: competitors can reverse-engineer the fingerprint and incorporate it in their own software.
This may sound vaguely nefarious but software impersonating other pieces of software is in fact quite common for compatibility. In fact web browsers practically invented that game. Take a look at the user-agent string early versions of Internet Explorer sent every website:
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; SLCC2; .NET CLR 2.0.50727; Media Center PC 6.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729)
These white-lies are ubiquitous on the web because compatibility is in the interests of everyone. Site publishers just want things to work. Amazon wants to sell books. With the exception of a few websites closely affiliated with a browser vendor, they do not care whether customers use Netscape, IE or Lynx to place their orders. There is no reason for websites to be skeptical about user-agent claims or run additional fingerprinting code to determine if a given web browser was really Netscape 4 or simply pretending to be. (Even if they wanted to such fingerprinting would have been difficult; for example IE often aimed for bug-for-bug compatibility with Netscape, even when that meant diverging from official W3C standards.)
Software discrimination and the bottom line
For reasons noted above, the competitive dynamics of IM were fundamentally different than web browsers. Most of the business models built around IM assumed full control over the client stack. For example, MSN Messenger floated ideas of making money by displaying ads in the client. This model runs into problems when customers run a different but interoperable client: Messenger could connect to the AOL network— effectively using resources & generating costs for AOL— while displaying ads chosen by MSN, earning revenue for MSFT.
Not surprisingly this resulted in an escalating arms-race. AOL included more and more subtle features into AIM that the server could use for fingerprinting. MSFT attempted to reverse engineer that functionality out of the latest AIM client and incorporate identical behavior into MSN Messenger. It helps that the PC platform was and to a large extent still is very much open to tinkering. Owners can inspect binaries running on their machine inspect network communications originating from a process or attach a debugger to a running application to understand exactly what that app is doing under specific circumstances. (Intel SGX is an example of a recent hardware development on x86 that breaks that assumption. It allows code to run inside protected “enclaves” shielded from any debugging/inspection capabilities from an outside observer.)
In no small measure of irony, the Messenger team voluntarily threw in the towel on interoperability when AOL escalated the arms-race to a point MSFT was unwilling to go: they deliberately included a remote code execution vulnerability in AIM intended for AOL servers to exploit. Whenever a client connect, the server would exploit the vulnerability to execute arbitrary code to look around the process and check on the identity of the application. Today such a bug would earn a critical severity rating and associated CVE if it were discovered in an IM client. (Consider that in the 1990s most Internet traffic was not encrypted, so it would have been much easier to exploit that bug; the AIM client had very little assurance that it was communicating with the legitimate AOL servers.) If it were alleged that a software publisher deliberately inserted such a bug into an application used by millions of people, it would be all over the news and possibly result in the responsible executives being dragged in front of congress for a ritual public flogging. In the 1990s it was business-as-usual.
Trusted computing and the dream of remote attestation
While the MSN Messenger team may have voluntarily hoisted the white-flag on that particular battle with AOL, a far-more powerful department within the company was working to make AOL’s wishes come true: a reliable solution to verifying the authenticity of software running on a remote peer, preferably without playing a game of chicken with deliberately introduced security vulnerabilities. This was the Trusted Computing initiative, later associated with the anodyne but awkward acronym NGSCB (“Next Generation Secure Computing Base”) though better remembered by its codename “Palladium.”
The lynchpin of this initiative was a new hardware component called the “Trusted Platform Module” meant to be included as an additional component on the motherboard. TPM was an early example of system-on-a-chip or SoC: it had its own memory, processor and persistent storage, all independent of the PC. That independence meant the TPM could function as a separate root of trust. Even if malware compromises the primary operating system and gets to run arbitrary code in kernel mode— the highest privilege level possible—it still can not tamper with the TPM or alter security logic embedded in that chip.
While the TPM specification defined a kitchen sink of functionality ranging from key management (generate and store keys on-board TPM in non-extractable fashion) to serving as a generic cryptographic co-processor, one feature stood out for use in securing the integrity of the operating system during the boot process: the notion of measured boot. At a high level, the TPM maintained a set of values in RAM dubbed “platform configuration register” or PCRs. When the TPM is started, these all start out at zero. What distinguishes PCRs is the way they are updated. It is not possible to write an arbitrary value into a PCR. Instead the existing value is combined with the new input and run through a cryptographic hash function such as SHA1; this is called “extending” the PCR in TCG terminology. Similarly it is not possible to reset the values back to zero, short of restarting the TPM chip, which only happens when the machine itself is power-cycled. In this way the final PCR value becomes a concise record of all the inputs that were processed through that PCR. Any slight change to any of the inputs or even changing the order of inputs results in a completely different value with no discernible relationship to the original.
This enables what TCG called the “measurement of trust” during the boot process, by updating the PCR with measurements of all code executed. For example, the initial BIOS code that takes control when a machine is first powered on updates PCR #0 with a hash of its own binary. Before passing control to the boot sector on disk, it records the hash of that sector in a different PCR. Similarly the early-stage boot loader first computes a cryptographic hash of the OS boot-loader and updates a PCR with that value, before executing the next stage. In this way, a chain of trust is created for the entire boot process with every link in the chain except the very first one recorded in some PCR before that link is allowed to execute. (Note the measurement must be performed by the predecessor. Otherwise a malicious boot-loader could update the PCR with a bogus hash instead of its own. Components are not allowed to self-certify their code; it must be an earlier piece of code that performs the PCR update before passing control.)
TCG specifications define the conventions for what components are measured into what PCR. These are different between legacy BIOS and the newer UEFI specifications. Suffice it to say that by the time a modern operating system boots, close to a dozen PCRs will have been extended with a record of the different components booted:
So what can be done with this cryptographic record of the boot process? While these values look random, they are entirely deterministic. Assuming the exact same system is powered on for two different occasions, identical PCR values will result. For that matter, if two different machines have the exact same installation— same firmware, same version of the operating system, same applications installed— it is expected that their PCRs will be identical. These examples hint at two immediate security applications:
- Comparison over time: verify that a system is still in the same known-good state it was at a given point in the past. For example we can record the state of PCRs after a server is initially provisioned and before it is deployed into production. By comparing those measurements against the current state, it is possible to detect if critical software has been tampered with.
- Comparison against a reference image: Instead of looking at the same machine over time, we can also compare different machines in a data-center. If we have PCR measurements for a known-good “reference image,” any server in healthy state is expected to have the same measurements in the running configuration.
Interestingly neither scenario requires knowing what the PCRs are ahead of time or even the exact details of how PCRs are extended. We are only interested in deltas between two sets of measurements. Since PCRs are deterministic, for a given set of binaries involved in a boot process we can predict ahead of time exactly what PCR values should result. There is a different use-case when those exact value matters: ascertaining whether a remote system is running a particular configuration.
Getting better at discrimination
Consider the problem of distinguishing a machine running Windows from one running Linux. These operating systems use a different boot-loader and the hash of that boot-loader gets captured into a specific PCR during measured boot. The value of that PCR will now act as a signal of what operating system is booted. Recall that each step in the boot-chain is responsible for verifying the next link; a Windows boot-loader will not pass control to a Linux kernel image.
This means PCR values can be used to prove to a remote system that you are running Windows or even running it in a particular configuration. There is one more feature required for this: a way to authenticate those PCRs. If clients were allowed to self-certify their own PCR measurements, a Linux machine could masquerade as a Windows box by reporting the “correct” PCR values expected after a Windows boot. The missing piece is called “quoting” in TPM terminology. Each TPM can digitally sign its PCR measurements with a private-key permanently bound to that TPM. This is called the attestation key and it is only used for signing such proofs unique to the TPM. (The other use case is certifying that some key-pair was generated on the TPM, by signing a structure containing the public key.) This prevents the owner from forging bogus quotes by asking the TPM to sign random messages.
This shifts the problem into a different plane: verifying the provenance of the “alleged” attestation, namely that it really belongs to a TPM. After all anyone can generate a key-pair and sign a bunch of PCR measurements with a worthless key. This is where the protocols get complicated and kludgy, partly because TCG tried hard placate privacy advocates. If every TPM had a unique, global AK for signing quotes, that key could be used as a global identifier for the device. TPM2 specification instead creates a level of indirection: there is an endorsement key (EK) and associated X509 certificate baked into the TPM at manufacture time. But EK is not used to directly sign quotes; instead users generate one or more attestation keys and prove that specific AK lives on the same TPM as the EK, using a challenge-response protocol. That links the AK to a chain of trust anchored in the manufacturer via the X509 certificate.
The resulting end-to-end protocol provides a higher level of assurance than is possible with software-only approaches such as “health agents.” Health agents are typically pieces of software running inside the operating system that perform various checks (check if latest software updates have been applied, firewall is enabled, no listening ports etc.) and report the results. The problem is those applications rely on the OS for their security. A privileged attacker with administrator rights can easily subvert the agent by feeding bogus observations or forging a report. Boot measurements on the other are implemented by firmware and TPM, outside the operating system and safe against any interference by OS-level malware regardless of how far it has escalated its privileges.
On the Internet, no one knows you are running Linux?
The previous example underscores a troubling link between measured boot and platform lock-in. Internet applications are commonly defined in terms of a protocol. As long as both sides conform to the protocol, they can play. For example XMPP is an open instant-messaging standard that emerged after the IM wars of the 1990s. Any conformant XMPP client following this protocol can interface with an XMPP server written according to the same specifications. Of course there may be additional restrictions associated with each XMPP server—such as begin able to authenticate as a valid user, making payments out-of-band if the service requires one etc. Yet these conditions exist outside of the software implementation. There is no a priori reason an XMPP client running on Mac or Linux could not connect to the same service as long as the same condition are fulfilled: the customer paid their bill and typed in the correct password.
With measured boot and remote attestation, it is possible for the service to unilaterally dictate new terms such as “you must be running Windows.” There is no provision in XMPP spec today to convey PCR quotes, but nothing stops MSFT from building an extension to accommodate that. The kicker: that extension can be completely transparent and openly documented. There is no need to rely on security through obscurity and hope no one reverse-engineers the divergence from XMPP. Even with full knowledge of the change, authors of XMPP clients for other operating systems are prevented from creating interoperable clients.
No need to stop with the OS itself. While TCG specs reserve the first few PCRs for use during the boot process, there are many more available. In particular PCRs 8-16 are intended for the operating system itself to record other measurements it cares about. (Linux Integrity Measurement Architecture or IMA does exactly that.) For example the OS can reserve a PCR to measure all device drivers loaded, all installed applications or even the current choice of default web browser. Using Chrome instead of Internet Explorer? Access denied. Assuming attestation keys were set up in advance and the OS itself is in a trusted state, one can provide reliable proof of any of this criteria to a remote service and create a walled-garden that only admits consumers running approved software.
The line between security feature and platform lock-in
Granted none of the scenarios described above have come to pass yet— at least not in the context of general purpose personal computers. Chromebooks come closest with their own notion of remote verification and attempt to create walled-gardens that limit accessibility only to applications running on a Chromebook. Smart-phones are a different story: starting with the iPhone, they were pitched as closed, blackbox appliances where owners had little hope of tinkering. De facto platform lock-in due to “iOS only” availability of applications is very common for services that are designed for mobile use in-mind. This is the default state of affairs even when the service provider is not making any deliberate attempts to exclude other platforms or use anything heavyweight along the lines of remote attestation.
This raises the question: is there anything wrong with a service provider restricting access based on implementation? The answer depends on the context.
Consider the following examples:
- Enterprise case. An IT department wants to enforce that employees only connect to the VPN from a company issued device (Not their own personal laptop)
- Historic instant messaging example. AOL wants to limit access to its IM service to users running the official AIM client (Not a compatible open-source clone or the MSN Messenger client published by MSFT)
- Leveraging online services to achieve browser monopoly. Google launches a new service and wants to restrict access only to consumers running Google Chrome as their choice of web-browser
It is difficult to argue with the first one. The company has identified sensitive resources— it could be customer PII, health records, financial information etc.— and is trying to implement reasonable access controls around that system. Given that company-issued devices are often configured to higher security standards than personal devices, it seems entirely reasonable to mandate that access to these sensitive systems only take place from the more trustworthy devices. Remote attestation is a good solution here: it proves that the access is originating with a device in known configuration. In fact PCR quotes are not the only way to get this effect; there are other ways to leverage the TPM to similar ends. For example, TPM specification allows generating key-pairs with a policy attached saying the key is only usable when the PCRs are in a specific state. Using such a key as the credential for connecting to the VPN provides an indirect way to verify the state of the device. Suppose employees are expected to be running a particular Linux distribution on their laptop. If they boot that OS, the PCR measurements will be correct and the key will work. If they install Windows on their system and boot that, PCR measurements will be different and their VPN key will not work. (Caveat: This is glossing over some additional risks. In a more realistic setting, we have to make sure VPN state can not be exported to another device after authentication or for that matter, a random Windows box can not SSH into the legitimate Linux machine and use its TPM keys for impersonation.)
By comparison, the second case is motivated by strategic considerations. AOL deems interoperability between IM clients a threat to its business interests. That is not an unreasonable view: interop gives challengers in the market a leg up against entrenched incumbents, by lowering switching costs. At the time AOL was the clear leader, far outpacing MSN and similar competitors in number of subscribers. The point is AOL is not acting to protect its customers privacy or save them from harm; AOL is only trying to protect the AOL bottom line. Since IM is offered as a free service, the only potential sources of revenue are:
- Selling data obtained by surveilling users
- Other applications installed with the client
The first one requires absolute control over the client. If an MSN Messenger user connects to the AOL network, that client will be displaying ads selected by Microsoft, not AOL. In principle the second piece still works as long as the customer is using AIM: every message sent is readable by AOL, along with metadata such as usage frequency and IP addresses used to access the service. But a native client can collect far more information by tapping into the local system: hardware profile, other applications installed, even browsing history, depending on how unscrupulous the vendors are (Given that AOL deliberately planted a critical vulnerability, there is no reason to expect they would stop shy of mining navigation history.) The last option also requires full control over the client. For example if Adobe were to offer AOL 1¢ for distributing Flash with every install of AIM, AOL could only collect this revenue from users installing the official AIM client, not interoperable ones that do not include Flash bundled. In all cases AOL stands to lose money if people could access the IM service without running the official AOL client.
The final hypothetical is a textbook example of leveraging monopoly in one business—online search for Google— to gain market share in another “adjacent” vertical, by artificially bundling two products. That exact pattern of behavior was at the heart of the DOJ antitrust lawsuit against MSFT in the late 1990s, alleging that the company illegally used its Windows monopoly to handicap Netscape Navigator and gain unfair advantage for Internet Explorer in market share. Except that by comparison the Google example is even more stark. While it was not a popular argument, some rallied to MSFT’s defense by pointing out that the controls of an “operating system” are not fixed and web browsers may one day be seen as an integral component, no different than TCP/IP networking. (In a delightful irony, Google itself proved this point later by grafting a lobotomized Linux distribution around the Chrome web-browser to create ChromeOS. This was an inversion of the usual hierarchy: instead of being yet another application included with the OS, the browser is now the main attraction that happens to include an operating system as bonus.) There is no such case to be made about creating a dependency between search engines in the cloud and web browsers used for accessing them. If Google resorted to using technologies such as measured-boot to enforce that interdependency— and in fairness, it has not, this remains a hypothetical at the time of writing— the company would be adding to a long rap-sheet of anticompetitive behavior that placed it in the crosshairs of regulators on both sides of the Atlantic.
You must be logged in to post a comment.