> This is not counting the weeks where I was “on call” and forced to drop all of this to work on a backlog of DevOps related issues.
I want to say sorry at the beginning if I misread what the author (you?) intended when you wrote this. And I'm coming at this from someone who is not a developer and who, compared to the lofty Software Engineer salaries in this industry, feels underpaid and underappreciated for doing the operations/administration work.
With that said: I truly wish those who are leaving their roles where this kind of requirement exists--feels "forced"--would speak more loudly about the need to have actual people doing system administration work. This is especially true, I feel, in large companies who operate "the Cloud" and who ought to know better; someone needs to be doing that work and it can't always, or usually, be the people who are writing the features and implementing the updates to the core product you are selling.
But what seems like has happened is everyone in the tech industry has forgotten it, or named it "Site Reliability Engineer" with a job of 55% failure analysis, 35% coding, and 10% "are the infrastructure and products actually online and functional." And then this role gets looked down on and paid less because the people in the role--people like me--are not seen as "delivering" "value".
Which culminates in the proper software developers seeing it as a thing they are forced to do, resulting in disdain, and furthering the cycle.
How, pray tell, is someone utterly unfamiliar with a code base supposed to be able to deal with unforeseen issues in production for that service?
(Worse, the incentives for improvements to production quality, and thus on-call quality, are utterly mis-aligned.)
Ironically, I say this as someone presently on-call for a bunch of stuff for which I am utterly unfamiliar with the code base of, and have no time to become familiar with. It's going predictably badly.
So just page that person directly (the dev, for the purposes of this argument).
(And because I feel like this is bound to draw a strawman, steelman this: while there are definitely pages that might not get routed to BE eng in particular, assume we're routing pages to the person responsible for that system. I.e., in the face of infrastructural problems, those pages get routed to something like "infra eng", although IME there's very little that can be done with those in the middle of the night…)
The SRE role comes from Google, where they are neither looked down on nor underpaid. Perhaps other companies have corrupted that title to mean something else.
Ideally incident handling should "just" be rolling back the broken change. Fixing the problem should be done in the morning with no time pressure, not in the middle of the night half asleep with customers on the other side of the world yelling at you. Of course it's not always that simple, but most of the time that's what on call should be about
It would be nice if things only broke during "business" hours and didn't have real world impact. Nevermind impact millions of people around the world. But if you look at the customers of say code that is running cloud infrastructure it is running airlines reservations/checkins, government workloads, banks, hospitals, critical infrastructure, netflix, gaming services. That's a lot of things that can't typically wait for morning.
This is the pat answer Amazon gives to defend this absurd practice, but it breaks down really easily.
>If your code breaks something, you should fix that code. Who else should?
What if it wasn't my code, but code written by someone 3 years ago who quit because most people only work at the company for 2 years? And it's in a part of the codebase I've never touched. That's a much more likely scenario.
The problem is that he has a big pile of half-working spaghetti code that he never has time to touch except when it malfunctions in the middle of the night
The problem behind that is that Amazon is a completely dysfunctional corporate hellscape. Like TFA said, you just don't have time or resources to actually fix things
You should join Amazon and do that, and you can come back here and apologize in a couple of years when you get pipped for wasting too much time on legacy code
I want to say sorry at the beginning if I misread what the author (you?) intended when you wrote this. And I'm coming at this from someone who is not a developer and who, compared to the lofty Software Engineer salaries in this industry, feels underpaid and underappreciated for doing the operations/administration work.
With that said: I truly wish those who are leaving their roles where this kind of requirement exists--feels "forced"--would speak more loudly about the need to have actual people doing system administration work. This is especially true, I feel, in large companies who operate "the Cloud" and who ought to know better; someone needs to be doing that work and it can't always, or usually, be the people who are writing the features and implementing the updates to the core product you are selling.
But what seems like has happened is everyone in the tech industry has forgotten it, or named it "Site Reliability Engineer" with a job of 55% failure analysis, 35% coding, and 10% "are the infrastructure and products actually online and functional." And then this role gets looked down on and paid less because the people in the role--people like me--are not seen as "delivering" "value".
Which culminates in the proper software developers seeing it as a thing they are forced to do, resulting in disdain, and furthering the cycle.