Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Because with tens or hundreds of thousands of racks it would be a full-time job to go through such alerts.

Sorry, I'm merely a lowly sysadmin and not a godly Google SRE, but environment monitoring is a solved problem, humans need not apply.



> Sorry, I'm merely a lowly sysadmin and not a godly Google SRE

Don't really like how this is framed - plenty of Google SREs are born and bred sysadmins.


My comment was a bit flippant I agree, and I know Google is a huge company so there are a great many people I have never interacted with.

That said; the google approach _does_ appear to quite literally throw away institutional knowledge from high throughput/high uptime and automation focused systems engineering.

I don’t discount that google SREs are highly intelligent and often know systems internals to the same or a greater extent than highly performing sysadmins. But there does appear to be desire to discredit anything seen as “traditional” and I think that can be harmful, like in this case.

Sometimes they know better, but there is a failing to ask the question: “what does it cost to only act on symptoms not causes”


> Sometimes they know better, but there is a failing to ask the question: “what does it cost to only act on symptoms not causes”

I imagine hardware breaking is the cost -- but as long as an engineer's time is worth more than the hardware, that tradeoff will be made.


Debugging and replacing is far more time consuming and costly, both in hardware money and engineer money.


He's talking about after you filter them with automation.

When you have enough machines, 10 of them are always running hot.


The "hot" we're talking here is beyond what the machines are specced for and suggests something is wrong (in this case the rack was tilted and it was messing with the cooling system). That should never happen under normal operation, otherwise it suggests a bigger problem (either the whole building is too hot, or there's a design flaw in the cooling system).


Lots of things that should "never" happen start to happen with frightening regularity once you have enough machines.

Few of them are worth paging someone.


Those are two extremes.

For me, when a machine reports as abnormal it will decommission itself and file a ticket to be looked at, we have enough extra capacity that it’s ok to do this for 5% of compute before we have issues, and we have an alert if it gets close to that.

If you’re doing anything other than that then either your machines are abnormal and functioning, which is scary because they’re now in an unknown state- or you’re able to just throw out anything that seems weird, which might also hide issues with the environment.


Whether you page someone or not, a machine hitting thermal throttling repeatedly should be just as notable as the abnormal failure rate that started the investigation.


> When you have enough machines, 10 of them are always running hot.

Sure, but isn't there rack-level toplogy information that should warn when all of the elements in a rack are complaining? That should statistically be a pretty rare occurrence, except over short periods, say, < 5-10 hours.


It's Google; they have sufficient ML capacity to filter cases. It might require additional manpower in the short term, but it also lowers the likelihood of infrastructure failure long-term.

What might not make sense, resource-wise, to smaller companies does start to make sense at their scale.


Anomaly detection on timeseries data is surprisingly bad. The data tends to be bad in all the wrong ways -- seasonal, and not normally distributed.

And worse, you have no training data, so ML model you train simply learns whatever badness is allowed to persist in production as normal.


I worked at a FAANG and every now and then someone proposed or tried to make an anomaly detection system. They never worked well. It’s an extremely difficult thing to get right. Better to just have good monitoring of whether your system is responsive/available/reliable and ways to quickly work out why it’s not.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: