At that scale, a complete outage is unlikely. I have services which haven't gone down _at all_ for longer than a year. But we lose requests every now and then -- during a deploy, or due to a bug. So we've moved from a time-based view of outage to a request-based view.
This helps, too, as it lets us build out services to be more reliable in combination, rather than less reliable. With retries and fail-over, an outage in an entire region may not necessarily result in any user requests failing.
For scale, pre-pandemic our published figures claimed >100M MAU.
I find https://andrewaylett.github.io/multi-burn-rate-calculator/ helpful for visualising error rates -- largely cribbed from the project it's forked from :) but with the tweakables switched around and the time between alert and error budget exhaustion in the tooltip.
It's worth noting that we only evaluate our alerts at most once a minute.
OVH didn't suffer a complete outage. If you were relying on that single DC, then you're probably not sufficiently large for this to apply to you.
But perhaps my point wasn't clearly enough made: a claim of "100% uptime" on a service level isn't particularly _useful_ when our users still only see a 99.9% success rate.
I think the weak point is their domain name. I think cloud providers should have a second domain, with a different registrar and managed sompletely independently, so that if one is subject to a problem (hijacking, dns outage, etc) clients can fallback to the alternative domain.
This helps, too, as it lets us build out services to be more reliable in combination, rather than less reliable. With retries and fail-over, an outage in an entire region may not necessarily result in any user requests failing.
For scale, pre-pandemic our published figures claimed >100M MAU.
I find https://andrewaylett.github.io/multi-burn-rate-calculator/ helpful for visualising error rates -- largely cribbed from the project it's forked from :) but with the tweakables switched around and the time between alert and error budget exhaustion in the tooltip.
It's worth noting that we only evaluate our alerts at most once a minute.