> AWS provides 99.5% EC2 uptime guarentees which is ~2 days a year of outages.
> That is not simply acceptable for most use cases and is why a single server just won't cut it.
I agree with your point in principle, but I wonder why this should be applicable to most use cases.
Our national tax reporting system failed on the first day when you could submit taxes that year and has done so across multiple years - and yet everyone just submits their reports a bit later and it's fine in the end. Our e-health system users would be glad to have only a few days of outages per year, instead it doesn't work most of the time, due to the project being an abject failure that padded the wallets of many of who were involved in "making" it. Our national COVID vaccine signup system failed on the first day, even though they attempted to implement queuing of requests, most people were only able to sign up days later. You hear on HN about Cloudflare outages, Google outages, GitHub outages and so on occasionally.
I understand why you might want resilience for something like pacemakers or airplane systems, but surely most CRUD apps or cloud based services out there aren't actually that important, we just make it seem like they are, in fear of users going to competition during the outage or something.
The tax and eHealth systems have the advantage of being monopolies though, it's not like you can go to another tax service with better ICT (except by emigrating of course), and it's literally illegal to not use the system at all (because you have to pay your taxes), so you can't even choose not to use it. Most companies do have competitors, so "we are more reliable" becomes a selling point for them.
Definitely agree that some companies take their uptime demands beyond the point of diminishing returns though. I once worked at a company which would absolutely not allow 10 minutes of scheduled downtime in the middle of the night but didn't mind at all if I took several months to build a workaround.
> That is not simply acceptable for most use cases and is why a single server just won't cut it.
I agree with your point in principle, but I wonder why this should be applicable to most use cases.
Our national tax reporting system failed on the first day when you could submit taxes that year and has done so across multiple years - and yet everyone just submits their reports a bit later and it's fine in the end. Our e-health system users would be glad to have only a few days of outages per year, instead it doesn't work most of the time, due to the project being an abject failure that padded the wallets of many of who were involved in "making" it. Our national COVID vaccine signup system failed on the first day, even though they attempted to implement queuing of requests, most people were only able to sign up days later. You hear on HN about Cloudflare outages, Google outages, GitHub outages and so on occasionally.
I understand why you might want resilience for something like pacemakers or airplane systems, but surely most CRUD apps or cloud based services out there aren't actually that important, we just make it seem like they are, in fear of users going to competition during the outage or something.