Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If your sanity check in your load balancer passes only on a 200, it will fail on a 500, disable the host, and keep retrying until it gets a 200 again. It helps for there to be more than one single request to try in your sanity check.

For "random" requests, if you have a 500 response, requests of the same "type" should not longer be sent to that host. This can be changed based on scoreboard settings. Depending on the context, you may choose to serve cached content on 500s. This is one of the reasons multiple layers of cache and application intelligence is so handy.

I'm not underselling anything. Domain-specific knowledge comes with experience. If you ask a mechanical engineer 'What's wrong with my car if it makes the noise "bang-sputz-sputz-screech-screech-screech?"', the engineer will start making you lists of what parts can make each of those noises and begin cross-referencing to see maybe in what conditions a combination of those might happen. The mechanic will immediately tell you that for your 1991 Mercury Sable, the A/F mixture is off, the MAF sensor needs cleaning, the radiator has a crack and the accessory belt needs replacing. Sysadmin is a trade, not a skill.



Okay, if we're only considering active health checks, then I'm not sure any load balancer considers a 5xx a success by default, let alone a "production" load balancer.

For a non-healthcheck 5xx response, it almost never is clear that this host in itself is responsible for the 5xx response. 5xx is the correct response when there is an error on the server side (ie, not an error on the client side, not a correct response), but it doesn't mean the server is a problem - it just means that the server experienced a problem in serving the request. That failure itself may be from one of many RPCs that server made to other services. As such, all web servers behind the load balancer for that request type will exhibit the 5xx response (at some rate, and depending on any state in connection sharing/reuse between the server and their upstream service), and all would subsequently be removed. Which isn't the correct response at all.

As someone who has had the job title "Systems Administrator" and the job title "Software Engineer", and currently has neither but still does exactly what he's always done - solving problems by understanding systems and, among other things, by writing code - I wouldn't consider load balancing and failure domains/types/handling as the sole or even primary purview of a systems administrator - especially in the case of large installations.


There's different kinds of load balancers, and as such different responses to different criteria. If you don't want to serve 500 error pages to all your users, one of your load balancers (or "proxy layers", for more or less intelligent forms of load balancers) should be doing something when you're getting 500s, like moving traffic around, or serving different content. It's far too common for 500s to be due to a machine-specific or network-specific problem to just assume they'll resolve themselves or are unresolvable.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: