A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.
-- Leslie Lamport
The Business Technology blog at the Wall Street Journal has a post about a service outage on Amazon's S3, a web service for online data storage.  Service describers are expressing anger at the three-hour outage because it undermines Amazon's claim of 99.99 percent availability.  Blog author Ben Worthen says that the incident raises questions about the prudence of entrusting your computing operations to a third party.
A two to three hour outage, while not common, is the sort of problem that a corporate information-technology department experiences from time to time. It’s bad for business, but not the end of the world. But when you’re dealing with something that challenges the status quo routine, problems invariably draw attention to shortcomings with the model. In this case, it isn’t your techies that are trying to get your business up and running, but Amazon’s.
Putting aspects of IT operations in the hands of other is not really that new or rare.  Lots of companies do not keep their customer-facing webservers in-house.  Instead, the place them at co-location centers, which have the responsibility of providing reliable network connectivity, power, and security.  There are lots of companies that have outsourced their mail servers, relying on a hosting service to provide e-mail delivery and storage.

I think Amazon's S3's outage more significant for two reasons:
  1. Amazon's web services, like S3 and EC2, are getting more attention in the mainstream press.  Amazon VP of Product Development is quoted indirectly by the Associated Press saying that the goal of their web services is to help entrepreneurs focus on ideas rather than server crashes.
  2. Highly consolidated services produce more severe outages.  This is not only true for hosted services like S3, it's also true for in-house systems where a single machine hosts several servers, each of which runs on a virtual machine.
In order for cloud computing to live up to its full promise, service providers will need to factor in ways to recover quickly from server outages, perhaps through some form of redundancy or clustering.  It will also require customers to rethink how they build their applications around these services to take into account a failure.

As for Amazon, I look for them to recover from this embarrassment.  They are probably one of the most advanced software development teams in the world, having tackled some very challenging system tasks in the past.  They'll figure this one out, too.