Managing the risk of cascading failure

xyzzy123 · on July 20, 2021

I've seen more places burn to the ground because everyone good on n < 10 team got poached by someone who left, than actual systems level cascading failures.

Person A leaves. Now it kinda sucks here. Person B leaves. It sucks more (and my workload got worse) etc. Guess what, A&B are hiring.

I guess this says more about "places I've worked" than "cascading failure" but I wanted to point out that it definitely happens at an organisational level, not just as some abstract computer thing.

steveBK123 · on July 20, 2021

Rings true. I've seen more "cascading failure" type behavior in the human staffing side of the equation than on the software side, where I am in NYC bank/fund/fintech space..

Less of a "move fast & break things" mentality in systems that touch money, with a bias for inaction over action. Systems tend to fall over in isolation more than go into epic retry loops of death. Better to be down than to be up in bad state spamming orders/money/etc. Knight being an exception that proves the rule maybe?

On the people side though its totally different. Teams are very lean, most of my teams have been 5ish people, and not unusual to join a team and realize 50% of you have been in the seat less than a year. You lose a lot of institutional knowledge and momentum. I joined a team once that was such an attrition disaster they hadn't shipped code to PROD in 6 months.

Been on too many teams where management was so busy trying to whack the 5-10% they want to cut that they don't notice the top 20% making their own way for the exits.

I once had 6 managers in a year. One of those transitions of power involved a Bob being named new department head on Monday night, hosting a pizza lunch on Tuesday, and then us getting an "actually Bob isn't going to be the lead, Jim is" email on Wednesday morning. I was 23 and the new 22 year old college hires arrived with no one to report to so I told them what to do for a few months while I interviewed elsewhere and left.

crest · on July 20, 2021

What you just describte sounds like a cascading failure in meat space.

steveBK123 · on July 20, 2021

I always tell people the hardest part of tech is the people, not the tech.

francisofascii · on July 20, 2021

Curious, why does one or two people leaving make it suck? If this is often the case, maybe management should find ways to address. More money, more opportunities. etc.

vinceguidry · on July 20, 2021

Generally speaking on most software teams, one or two devs do most of the work, bear most of the responsibility, and it's a huge game of catch-up for the rest of the team if they leave.

The reason why this is is twofold. First, this is knowledge work, the main task you're doing is fitting stuff in your head. Once one person has something fit into their head, the benefits for the second person doing the work of fitting it also into their head is not nearly as much. So knowledge becomes siloed, and 'knowledge transfer' doesn't work because there's very little the person with the knowledge can do to speed along the process of another person cramming it into their head.

Second, infrastructure become unavoidably, yet unnecessarily complex. Decisions about which software projects to use and depend on get handed down by decision-makers who never have to touch those projects themselves.

An example from my current team is GitHub Actions. We had to throw away a working Cloud Build CI infra because they liked the pretty interface. But the rest of our infra is on GCP. This complicated the setup such that it was very difficult for other people on the team to build working pipelines. So the knowledge about how it all works gets further siloed into a few people, who find showing others how it works difficult, because of the complexity. In practice the people that know end up doing the hard complex parts for them, no knowledge gets shared, and piling on to the list of things the rest of the team has to learn once they leave.

In larger companies, this is solved at the management level like you say. They'll pressure the devs to come up to speed faster as much as they can get away with, incentivize key resources to stick around. The root cause can't be solved at a management level because management can only manage, they have no insight into software complexity and team dynamics. In smaller companies, the resources for doing this are non-existent, it can only be solved through prevention. Not over-complicating the stack, avoiding endless churn. Proactively rotating devs onto other parts of the team to facilitate knowledge transfer. Doing it before someone leaves. Spotting silos and getting someone else in there to learn what's going on.

mym1990 · on July 20, 2021

The same reason that LeBron leaving the Lakers would put them at the bottom of the table overnight. While knowledge sharing and thorough documentation is a way to mitigate risk, I don't think it will ever address the loss of an impactful person's vision, knowledge, work ethic, etc...

Sometimes other people will also lose morale or see it as a chance to leave themselves...

It also takes considerable time to onboard a replacement, which can drastically affect timelines/budgets/etc...

steveBK123 · on July 20, 2021

Sometimes you get newer management with NIH syndrome. So existing "legacy" project teams (the ones that are actually in production) can get squeezed on both ends.

First they get a lot of BAU pressure to not have outages / ship features faster / etc. Second they also get Change pressure to completely re-architect while doing the BAU and not increasing staffing. For some who play chess in the department head seat, it is win win.

If the legacy team has more outages / doesn't speed up on features, he has his remit to build his new pet project replacement and 2-3 years before theres any accountability for the outcome.

If the legacy team succeeds in re-architecting then he also gets a win.