If your sanity check in your load balancer passes only on a 200, it will fail on...

nbm · on Feb 17, 2015

Okay, if we're only considering active health checks, then I'm not sure any load balancer considers a 5xx a success by default, let alone a "production" load balancer.

For a non-healthcheck 5xx response, it almost never is clear that this host in itself is responsible for the 5xx response. 5xx is the correct response when there is an error on the server side (ie, not an error on the client side, not a correct response), but it doesn't mean the server is a problem - it just means that the server experienced a problem in serving the request. That failure itself may be from one of many RPCs that server made to other services. As such, all web servers behind the load balancer for that request type will exhibit the 5xx response (at some rate, and depending on any state in connection sharing/reuse between the server and their upstream service), and all would subsequently be removed. Which isn't the correct response at all.

As someone who has had the job title "Systems Administrator" and the job title "Software Engineer", and currently has neither but still does exactly what he's always done - solving problems by understanding systems and, among other things, by writing code - I wouldn't consider load balancing and failure domains/types/handling as the sole or even primary purview of a systems administrator - especially in the case of large installations.

0xbadcafebee · on Feb 17, 2015

There's different kinds of load balancers, and as such different responses to different criteria. If you don't want to serve 500 error pages to all your users, one of your load balancers (or "proxy layers", for more or less intelligent forms of load balancers) should be doing something when you're getting 500s, like moving traffic around, or serving different content. It's far too common for 500s to be due to a machine-specific or network-specific problem to just assume they'll resolve themselves or are unresolvable.