Most cloud vendors promise great uptime and brag about having redundant offsite facilities.

How many are planning for DUAL (or MULTIPLE) datacenter failures?

Twitter, SalesForce, Amazon have each had massive outages due to multiple, cascading failures.



Twitter went down last night for several hours because – the company has now confirmed – redundancy in the micro-blogging site’s data centres failed to kick in.

The result was a catastrophic system collapse, Twitter’s engineering veep Mazen Rawashdeh explained:

The cause of today’s outage came from within our data centers. Data centers are designed to be redundant: when one system fails (as everything does at one time or another), a parallel system takes over. What was noteworthy about today’s outage was the coincidental failure of two parallel systems at nearly the same time.

The company is now “aggressively” investigating what Rawashdeh described as an “infrastructural double-whammy” to find out what went wrong with its failover system and to prevent it happening in the future.

“On behalf of our infrastructure team, we apologise deeply for the interruption you had today. Now – back to making the service even better and more stable than ever,” the exec added.

via Twitter titsup: Our failover was actually just FAIL ALL OVER • The Register.