Time-Warner cable experienced an outage in Internet access two weeks ago in New York that lasted almost a full day. The service is a joint-offering with Earthlink, so it is not clear where the blame goes. Such large scale service failures can happen: a number of undersea cables were cut in the Middle East, affecting net access in Egypt among other countries. The fact that this can happen in Manhattan of all places is another story. But even more disconcerting is the way Time-Warner believe customers should be compensated: by offering a refund for a single day which amounts to roughly 3% off the monthly bill. TW/Earthlink is trying to price reliability here, and that have significantly undervalued it.
No service can guarantee 24/7 uptime. But a service that is advertised with availability 99.999% of the time is not simply worth just another 2.999% over one that only works 97% of the time. It is far more valuable because at that limit diminishing returns have kicked in. Adding one more nine to the availability number requires a lot of investment. As the service-level guarantee increases, the system designers must contend with increasingly esoteric and improbable events. A very simplified example: a RAID array can ensure that a computer will survive a single drive failure– an event that happens with disturbingly high frequency for machines that are running under load all the time– by using multiple drives as redundancy. So if disks are fail 1% of the time and this is the most likely problem, 99% uptime is achieved by investing in improved storage solutions. But suppose there is a smaller 0.1% chance that the entire data-center can go up in smoke or the power can fail longer than the on-site generators can compensate. This is a lower probability event but being prepared is more difficult. Adding more drives does not help because their failures are correlated. The same fire will take out all of them. Dealing with the less likely but more catastrophic event calls for building a brand new data center some place else and adding software logic to handle fail-over in case of an outage in the primary site, a much more expensive proposition.
Time-Warner assumed that if customers are paying $30 for an almost always reliable service, they should have no problem paying a few percent less for one that experiences a massive outage every month. In fact Internet access advertised up front as working only 97% of the time would be worth much less and provide stronger incentives for customers to switch to an alternative such as fiber to the home.
Update: TW/Earthlink experienced another outage on Friday. This time they were apparently prepared: customers calling the support number were greeted with an automated recording announcing that New York was experiencing service problems. Meanwhile the otherwise reliable Verizon wireless access card had crawled to a halt when this blogger pressed it into service as a back-up, probably because other users had the same idea and Verizon did not expect to become the alternative broadband provider for a chunk of Manhattan.