"The service that keeps track of the state of the world has a fail-safe mode whe...

Domenic_S · on July 6, 2012

That was my thought as well when reading that sentence (actually, I was thinking "was this overengineered for no good reason?), however they go on to say that there is a purpose -- mitigating "network partition events" which I can only guess is referring to AWS's version of netsplits.

It sounds like there was some technical debt to that implementation, but hey, I for one am glad they gave us some insight into what happened.

adrianco · on July 7, 2012

"Technical debt" is a nice way of saying it had bugs. It was mostly a configuration problem, if it had been setup better we would have had no outage or a much shorter one. The work to test all our zone level resilience (Chaos Gorilla) was underway but hadn't got far enough to uncover this bug.