> Because with tens or hundreds of thousands of racks it would be a full-time jo...

cameronbrown · on March 16, 2020

> Sorry, I'm merely a lowly sysadmin and not a godly Google SRE

Don't really like how this is framed - plenty of Google SREs are born and bred sysadmins.

dijit · on March 16, 2020

My comment was a bit flippant I agree, and I know Google is a huge company so there are a great many people I have never interacted with.

That said; the google approach _does_ appear to quite literally throw away institutional knowledge from high throughput/high uptime and automation focused systems engineering.

I don’t discount that google SREs are highly intelligent and often know systems internals to the same or a greater extent than highly performing sysadmins. But there does appear to be desire to discredit anything seen as “traditional” and I think that can be harmful, like in this case.

Sometimes they know better, but there is a failing to ask the question: “what does it cost to only act on symptoms not causes”

cameronbrown · on March 16, 2020

> Sometimes they know better, but there is a failing to ask the question: “what does it cost to only act on symptoms not causes”

I imagine hardware breaking is the cost -- but as long as an engineer's time is worth more than the hardware, that tradeoff will be made.

tjoff · on March 16, 2020

Debugging and replacing is far more time consuming and costly, both in hardware money and engineer money.

joshuamorton · on March 15, 2020

He's talking about after you filter them with automation.

When you have enough machines, 10 of them are always running hot.

Nextgrid · on March 15, 2020

The "hot" we're talking here is beyond what the machines are specced for and suggests something is wrong (in this case the rack was tilted and it was messing with the cooling system). That should never happen under normal operation, otherwise it suggests a bigger problem (either the whole building is too hot, or there's a design flaw in the cooling system).

joshuamorton · on March 15, 2020

Lots of things that should "never" happen start to happen with frightening regularity once you have enough machines.

Few of them are worth paging someone.

dijit · on March 15, 2020

Those are two extremes.

For me, when a machine reports as abnormal it will decommission itself and file a ticket to be looked at, we have enough extra capacity that it’s ok to do this for 5% of compute before we have issues, and we have an alert if it gets close to that.

If you’re doing anything other than that then either your machines are abnormal and functioning, which is scary because they’re now in an unknown state- or you’re able to just throw out anything that seems weird, which might also hide issues with the environment.

Dylan16807 · on March 16, 2020

Whether you page someone or not, a machine hitting thermal throttling repeatedly should be just as notable as the abnormal failure rate that started the investigation.

dnautics · on March 16, 2020

> When you have enough machines, 10 of them are always running hot.

Sure, but isn't there rack-level toplogy information that should warn when all of the elements in a rack are complaining? That should statistically be a pretty rare occurrence, except over short periods, say, < 5-10 hours.

popinman322 · on March 15, 2020

It's Google; they have sufficient ML capacity to filter cases. It might require additional manpower in the short term, but it also lowers the likelihood of infrastructure failure long-term.

What might not make sense, resource-wise, to smaller companies does start to make sense at their scale.

jldugger · on March 16, 2020

Anomaly detection on timeseries data is surprisingly bad. The data tends to be bad in all the wrong ways -- seasonal, and not normally distributed.

And worse, you have no training data, so ML model you train simply learns whatever badness is allowed to persist in production as normal.

cranekam · on March 16, 2020

I worked at a FAANG and every now and then someone proposed or tried to make an anomaly detection system. They never worked well. It’s an extremely difficult thing to get right. Better to just have good monitoring of whether your system is responsive/available/reliable and ways to quickly work out why it’s not.