GoDaddy outage and lessons for certificate revocation (1/2)

One of the unintended side-effects of recent GoDaddy outage was providing a data point on the debate around whether certificate revocation checks can be made to fail-hard. In addition to being the registrar and DNS provider for several million websites, GoDaddy also operates a certificate authority. The outage inadvertently created an Internet-wide test of  revocation checks to operate in offline mode.

Quick recap: when establishing a secure connection  to a website using SSL, web browsers will verify the identity of that site, in a format known as X509 digital certificates. Most of these checks can be done locally and are very efficient in nature. For example, verifying that the certificate has been issued by a trusted authority, that it is not expired and the validated name specified in the certificate is consistent with the name of the website the user expected to visit– eg what appears on the address bar. But there is an additional check that may require looking up additional information from the web: verifying that the certificate has not been revoked by since its time of issuance.

It is fair to say that web browser developers hate revocation checking because of its performance implications. The web is all about speed, with each browser vendor cherry picking their own set of performance benchmarks. Over time an extensive bag of tricks has been developed to squeeze every last ounce of bandwidth available from the network and fetch those webpages an imperceptible fraction of a second faster than the competing browser. Revocation checking throws a wrench into that by stopping everything in its tracks, until information about the validity of the certificate has been retrieved. (In fact it is more than one connection that is stalled: one of the standard speed improvements involves creating multiple connections to request resources in parallel, such as the style-sheet and images concurrently. All of these are blocked on vetting the certificate status.)

Almost always, the revocation checks pass uneventfully. In rare cases the client may discover that a certificate was indeed revoked, saving the user from a man-in-the-middle attack, although there have been no recorded cases of that happening in recent memory. (For the most epic CA failures such as DigiNotar incident, the certificates in question were blacklisted by out-of-band channels and the attacks stopped as soon as they were publicized, long before revocation checks could save the day.) Then there is the third possibility of a gray area: revocation check is inconclusive because the network request to ascertain the certificate status has failed. A conservative design favors failing-safe and assuming the worst. In reality, due to historically low confidence in services providing revocation status (imagined or real) most implementations fail open, and assume the certificate is not revoked. For example Internet Explorer does not even warn about failed checks in the default configuration. This is the basis for rants to the effect that revocation checking is useless— after all, in any scenario where an adversary has enough control over the network to orchestrate a man-in-the-middle attack, they are also capable of blocking any traffic from that user to the revocation provider.

Luckily the situation is not quite as bleak as described above. There are several optimizations in place to avoid having costly, unreliable network look ups for each certificate validated. First there is aggregation within each CA: a certificate revocation list or CRL contains list of all revoked certificates for that issuer. This spreads the network cost of a revocation across multiple certificates. This list in turn can be broken up into incremental updates called delta CRLs, to avoid downloading an ever-expanding list each time.  Finally each update has a well-defined lifetime, including an expected point when the next update will be published.  Combined with the fact that CRLs are signed,  they can be cached both locally and by intermediate proxies at the edge of the network for scaling.

The story gets more complicated when considering there is another way to perform revocation checking: online certificate status protocol or OCSP. This is a more targeted approach to querying trust status– instead of downloading a list of all the bad certificates, one queries a server about a particular certificate identified by serial number. OCSP does not amortize cost over multiple queries, since finding the status for one website does not help answer the same question about a different one. On the bright side, OCSP responses also have well-defined lifetimes times much like CRLs, obviating the need for additional queries during that period. Also in the spirit of CRLs, they are signed objects, permitting caching by intermediate proxies to decentralize distribution.

All of this would suggest that perhaps clients could cope with a temporary outage of  revocation servers, even if they opted for hard-fail approach. (Recall that hard-fail means connections will not proceed without positive proof that certificate was not revoked.) In principle all that caching could permit users to still visit websites when the revocation infrastructure becomes unreachable, when both the CRL distribution point (CDP) and OCSP responder are down– exactly what happened to GoDaddy.

The question is, how well would that have worked out for GoDaddy during its outage?
Not very well, it turns out.



Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s