Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

At that scale, a complete outage is unlikely. I have services which haven't gone down _at all_ for longer than a year. But we lose requests every now and then -- during a deploy, or due to a bug. So we've moved from a time-based view of outage to a request-based view.

This helps, too, as it lets us build out services to be more reliable in combination, rather than less reliable. With retries and fail-over, an outage in an entire region may not necessarily result in any user requests failing.

For scale, pre-pandemic our published figures claimed >100M MAU.

I find https://andrewaylett.github.io/multi-burn-rate-calculator/ helpful for visualising error rates -- largely cribbed from the project it's forked from :) but with the tweakables switched around and the time between alert and error budget exhaustion in the tooltip.

It's worth noting that we only evaluate our alerts at most once a minute.



Said OVH. Then their datacenter burned to the grounds.

Said Oracle. Then their DNS was misconfigured and their whole cloud went offline for 2 hours [1].

Shit happens, always, at all scales.

[1] https://ocistatus.oraclecloud.com/incidents/qjxllgkywysj

Edit: typo


OVH didn't suffer a complete outage. If you were relying on that single DC, then you're probably not sufficiently large for this to apply to you.

But perhaps my point wasn't clearly enough made: a claim of "100% uptime" on a service level isn't particularly _useful_ when our users still only see a 99.9% success rate.


I think the weak point is their domain name. I think cloud providers should have a second domain, with a different registrar and managed sompletely independently, so that if one is subject to a problem (hijacking, dns outage, etc) clients can fallback to the alternative domain.


That literally happened, they blogged about it recently. https://www.backblaze.com/blog/recent-outages-why-we-acceler...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: