Why would using microservices reduce the chance of outages? If you break a micro...

therealdrag0 · on Jan 3, 2023

Sure, but not all micro-services are vital. If your "email report" service has a memory leak (or many other noisy-neighbor issues) and is in a crash loop then that wont take down the "search service" or the "auth service", etc. Many other user paths will remain active and usable. It compartmentalizes risk.

whiplash451 · on Jan 3, 2023

Proper design in a monolith would also protect you from failures of non-vital services (e.g. through exception capture).

So it seems like we’re trying to compensate bad design with microservices. It’s orthogonal IMO.

therealdrag0 · on Jan 3, 2023

How does exception capture protect from all failures? The most obvious one I don't see it relating to is resource utilization, CPU, memory, threadpools, db connection pools, etc etc.

> we’re trying to compensate bad design

No I think we're trying to compensate for developer mistakes and naivety. When you have dozens to hundreds of devs working on an application many of them are juniors and all of them are human and impactful mistakes happen. Just catching the right exceptions and handling them the right way does not protect against devs not catching the right exceptions and not handling them the right way, but microservices does.

Maybe you call that compensating for bad design, which is fair and in that case yes it is! And that compensation helps a large team move faster without perfecting design on every change.

JAlexoid · on Jan 4, 2023

With microservices you have to have a tradeoff - a monolith is inherently more testable at integration level, than a microservice based architecture.

There's a significant overhead to build and run tests at API level, that includes API versioning... and there's less of a need to version API inside a monolith.

lowbloodsugar · on Jan 4, 2023

You have fifty (or 10,000) servers running your critical microservice in multiple AZs. You start a deployment to a single host. The shit hits the fan. You rollback that one host. If it looks fine, you leave it running for a few hours while various canaries and integration tests all hit it. If no red flags occur, you deploy another two, etc. You deploy to different AZs on different days. You can fail over to your critical service in different AZs because you previously ensured that the AZs are scaled so that they can handle that influx of traffic (didn't you?). You've tested that.

And that is if it makes it to production. Here is your fleet of test hosts using production data and being verified against the output of production servers.