"The service that keeps track of the state of the world has a fail-safe mode where it will not remove unhealthy instances in the event that a significant portion appears to fail simultaneously."
That was my thought as well when reading that sentence (actually, I was thinking "was this overengineered for no good reason?), however they go on to say that there is a purpose -- mitigating "network partition events" which I can only guess is referring to AWS's version of netsplits.
It sounds like there was some technical debt to that implementation, but hey, I for one am glad they gave us some insight into what happened.
"Technical debt" is a nice way of saying it had bugs. It was mostly a configuration problem, if it had been setup better we would have had no outage or a much shorter one. The work to test all our zone level resilience (Chaos Gorilla) was underway but hadn't got far enough to uncover this bug.
You should keep your logics dumb.