My comment was a bit flippant I agree, and I know Google is a huge company so there are a great many people I have never interacted with.
That said; the google approach _does_ appear to quite literally throw away institutional knowledge from high throughput/high uptime and automation focused systems engineering.
I don’t discount that google SREs are highly intelligent and often know systems internals to the same or a greater extent than highly performing sysadmins. But there does appear to be desire to discredit anything seen as “traditional” and I think that can be harmful, like in this case.
Sometimes they know better, but there is a failing to ask the question: “what does it cost to only act on symptoms not causes”
The "hot" we're talking here is beyond what the machines are specced for and suggests something is wrong (in this case the rack was tilted and it was messing with the cooling system). That should never happen under normal operation, otherwise it suggests a bigger problem (either the whole building is too hot, or there's a design flaw in the cooling system).
For me, when a machine reports as abnormal it will decommission itself and file a ticket to be looked at, we have enough extra capacity that it’s ok to do this for 5% of compute before we have issues, and we have an alert if it gets close to that.
If you’re doing anything other than that then either your machines are abnormal and functioning, which is scary because they’re now in an unknown state- or you’re able to just throw out anything that seems weird, which might also hide issues with the environment.
Whether you page someone or not, a machine hitting thermal throttling repeatedly should be just as notable as the abnormal failure rate that started the investigation.
> When you have enough machines, 10 of them are always running hot.
Sure, but isn't there rack-level toplogy information that should warn when all of the elements in a rack are complaining? That should statistically be a pretty rare occurrence, except over short periods, say, < 5-10 hours.
It's Google; they have sufficient ML capacity to filter cases. It might require additional manpower in the short term, but it also lowers the likelihood of infrastructure failure long-term.
What might not make sense, resource-wise, to smaller companies does start to make sense at their scale.
I worked at a FAANG and every now and then someone proposed or tried to make an anomaly detection system. They never worked well. It’s an extremely difficult thing to get right. Better to just have good monitoring of whether your system is responsive/available/reliable and ways to quickly work out why it’s not.
Sorry, I'm merely a lowly sysadmin and not a godly Google SRE, but environment monitoring is a solved problem, humans need not apply.