Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"The service that keeps track of the state of the world has a fail-safe mode where it will not remove unhealthy instances in the event that a significant portion appears to fail simultaneously."

You should keep your logics dumb.



That was my thought as well when reading that sentence (actually, I was thinking "was this overengineered for no good reason?), however they go on to say that there is a purpose -- mitigating "network partition events" which I can only guess is referring to AWS's version of netsplits.

It sounds like there was some technical debt to that implementation, but hey, I for one am glad they gave us some insight into what happened.


"Technical debt" is a nice way of saying it had bugs. It was mostly a configuration problem, if it had been setup better we would have had no outage or a much shorter one. The work to test all our zone level resilience (Chaos Gorilla) was underway but hadn't got far enough to uncover this bug.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: