> This is not counting the weeks where I was “on call” and forced to drop all of...

deathanatos · on Nov 16, 2022

How, pray tell, is someone utterly unfamiliar with a code base supposed to be able to deal with unforeseen issues in production for that service?

(Worse, the incentives for improvements to production quality, and thus on-call quality, are utterly mis-aligned.)

Ironically, I say this as someone presently on-call for a bunch of stuff for which I am utterly unfamiliar with the code base of, and have no time to become familiar with. It's going predictably badly.

greedo · on Nov 17, 2022

On call isn't about fixing the problem yourself, it's knowing who the right person is to contact when an issue arises.

andelink · on Nov 17, 2022

Ehh, if your oncall shift mostly an exercise in reassigning pages, you need to invest more time in proper incident routing

deathanatos · on Nov 17, 2022

So just page that person directly (the dev, for the purposes of this argument).

(And because I feel like this is bound to draw a strawman, steelman this: while there are definitely pages that might not get routed to BE eng in particular, assume we're routing pages to the person responsible for that system. I.e., in the face of infrastructural problems, those pages get routed to something like "infra eng", although IME there's very little that can be done with those in the middle of the night…)

titanomachy · on Nov 16, 2022

The SRE role comes from Google, where they are neither looked down on nor underpaid. Perhaps other companies have corrupted that title to mean something else.

p0rkbelly · on Nov 17, 2022

If your code breaks something, you should fix that code. Who else should?

If your system/product/service is down because you have a dependency on something that broke -- well it's up to that team to fix their code.

I_AM_A_SMURF · on Nov 17, 2022

Ideally incident handling should "just" be rolling back the broken change. Fixing the problem should be done in the morning with no time pressure, not in the middle of the night half asleep with customers on the other side of the world yelling at you. Of course it's not always that simple, but most of the time that's what on call should be about

p0rkbelly · on Nov 17, 2022

It would be nice if things only broke during "business" hours and didn't have real world impact. Nevermind impact millions of people around the world. But if you look at the customers of say code that is running cloud infrastructure it is running airlines reservations/checkins, government workloads, banks, hospitals, critical infrastructure, netflix, gaming services. That's a lot of things that can't typically wait for morning.

dimmke · on Nov 17, 2022

This is the pat answer Amazon gives to defend this absurd practice, but it breaks down really easily.

>If your code breaks something, you should fix that code. Who else should?

What if it wasn't my code, but code written by someone 3 years ago who quit because most people only work at the company for 2 years? And it's in a part of the codebase I've never touched. That's a much more likely scenario.

nevon · on Nov 17, 2022

That's still your code ("your" meaning the team that owns the product). Who else would own it? The person that left 3 years ago?

Firmwarrior · on Nov 17, 2022

The problem is that he has a big pile of half-working spaghetti code that he never has time to touch except when it malfunctions in the middle of the night

The problem behind that is that Amazon is a completely dysfunctional corporate hellscape. Like TFA said, you just don't have time or resources to actually fix things

dimmke · on Nov 17, 2022

Also usually it’s in Java in an internal framework based off Springboot and I’m a front-end developer with no experience writing backend services.

Totally normal. Not crazy at all. Take ownership.

Firmwarrior · on Nov 17, 2022

You should join Amazon and do that, and you can come back here and apologize in a couple of years when you get pipped for wasting too much time on legacy code