amazon_aws_logo.jpgIn quite a week for Internet-based problems, the services side of Amazon (news, site) has promised a full inquiry into the problems that beset its North Virginia data center last week.

The Board is Green

With Amazon's status page back up to a healthy row of green ticks, it is now time for the company to analyze and reflect on what went wrong in the enterprise 2.0 world at its North Virginia data center, how to prevent it happening again and to compensate those who suffered from the failure.

The outage (read through the drama here) caused a number of small but popular sites, including foursquare and Reddit, to lose their online presence, while many Amazon business users lost access to their data. Overall, the outage has left a fraction (0.07%) of the total affected data unrecoverable, according to Amazon, and you have to hope those affected had efficient backup strategies.

Lessons to be Learned

For customers and Amazon, and any business that uses cloud services, there are a few obvious lessons to learn here. We don't need to go into detail, but some basic questions must be asked and should be answered by the relevant executives/department heads quickly.

  • Do we have a valid backup system? Is it working?
  • Do we have an alternate cloud or local data store/service? If not, can we start finding one?
  • If our cloud services failed, what parts of our business could continue?
  • If our site/service failed, how do we reach out to customers?
  • What is our service-level agreement and do we need to change it?

In reality, this specific outage is unlikely to happen again -- Amazon will learn the lessons and fix this problem. But something similar at another provider or service will occur every now and again, and act as a timely reminder to all in the web business to keep the i's dotted and t's crossed.

The Customers Speak

Those affected outside of Amazon by the outages have started putting their services back together and getting on with life. Reddit is still running in a reduced emergency mode, while there's a full story on HootSuite's recovery on BrainYard.

Foursquare looks back to normal, while Drupal noted that:

Late last week, Drupal Gardens experienced an unexpected delay in service due to a widely-publicized outage at our partner Amazon Web Services (AWS). Sites were restored through Friday and, by midnight Friday, all sites were back up and fully functional.

The purpose of this message is to offer our sincere apologies to you for this disruption and to give affected customers a free month of service. Annual subscribers will get a one-month extension. For those with monthly plans, your next month of service will be completely free of charge."

There are also many articles highlighting the lessons to be learned, counteracting some of the wrong assumptions that people are making and, as Amazon gives out the actual details, there will be a lot of peering into the technical failings. It will make for interesting reading, and could help others avoid the same pitfalls, but it won't stop something similarly dramatic happening in the future, because there is simply no such thing as 100% reliable.