Reliability in the cloud: hidden dependencies

The much discussed outage of Skype last month and its eventual attribution in part to Windows Update— which itself was functioning perfectly normally as designed– leads to a number of interesting observations.

  • Distributed peer-to-peer systems were heralded for their reliability, owing to the lack of any SPOFs, or single-point-of-failure. Skype routes calls using a P2P network of machines owned by its own members (although authentication is centralized) and if there is anything in plentiful supply on the Net, it’s machines with idle time/bandwidth. The outage suggested that the parts were not quite as loosely-coupled as classic distributed systems theory would have one believe– they could fail in quite coordinated manner because they all sport the identical configuration.
  • Diversity fans will probably jump at the occasion to point out the evils of software uniformity. If some larger fraction of the Skype clients were running Linux or Mac, they may not have rebooted at the same time and spared the outage, the argument runs. But this is unlikely to make a quantitative difference as even the egalitarian market divided three-ways between Windows/Linux/Mac would have substantial number of nodes of any variant. Also it is possible to get diversity of behavior on the cheap without diversity of platform– in this case, randomly spreading apart the patch installation/reboot will do the trick.
  • This was a completely unexpected interaction between 2 cloud services, one for VoIP and one for software distribution. It’s taken for granted that two client applications  installed on the same machine can have allergic reactions and blow up the machine. (This is the well-known DLL hell problem in Windows.) But the dramatic demonstration that WU, a service hosted “out-there” in the cloud could impact another completely independent service hosted elsewhere is news.
  • What does this mean for those engineering services? Keeping Skype up and running was not in the design criteria for WU. If anything getting security patches out to vulnerable machines ASAP would have increased the pressure on Skype by rebooting all machines quickly.  A lot of software these days has auto-update capability, mostly poorly designed and not even giving the users chance to consent. It’s not a stretch to assume that one could initiate a forced reboot of most  Windows or Mac machines in quick succession. Is Skype at fault then for depending on the uptime of  machines that it has no control over? The architects did the right thing and hedged their bets statistically by requiring some fraction of their nodes to be operational. Is that more of a gamble than building a giant data center stacked with wall-to-wall racks of servers?  (It’s certainly cheaper and more efficient, and environmentally friendly considering the power usage of the modern data-center. And redundancy would have required multiple DCs, geo-located around the world.) Until now the gamble worked correctly but one day WU pushed the system beyond its  critical threshold.


Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s