Lessons Netflix Learned from the AWS Storm

paulsutter · on July 6, 2012

It's not clear they learned the simplest and most import ant fact: You have to be able to migrate your production traffic away from whole regions.

jedberg · on July 6, 2012

Paul, I'm sure your skills are top notch, but even you have to admit that the problem you solved -- a globally available write only system -- is a totally different problem than the one Netflix is solving, which is a read/write workload.

Also, as someone who was a user of your globally available service, I can tell you that while it might have been UP all the time, it certainly had no problems losing data all the time too. Some months there was just simply no data for reddit at all, even though we were sending the service more than a billion data points.

So we could sit here and sling insults all day, or you can operate under the assumption I do -- that each of is a competent engineer who works for a business that has to make decisions and tradeoffs between costs and reliability.

paulsutter · on July 6, 2012

We're responding to a post, by Netflix, explaining their downtime. That post is missing the single most important fact - that they need to be able to failover across regions. The rest of their explanation is just second-order noise.

Anyone who reads HN can see that a minimum uptime strategy for Amazon is to failover across regions. Each time there is a major AWS outage, we hear about HN readers whose service was affected even though they spanned availability zones within a single region. But to date, Amazon's regions have operated independently.

That observation is not dependent on knowledge of Quantcast (which is incidentally, far more than a write-only system), or the other production systems I've built in the last 35 years.

(I'll follow up by email about your support questions)

paulsutter · on July 8, 2012

A little transparency can make life easier. Try this:

"Don't panic. You are using a backup datacenter. Some very recent queue or account changes may be missing, and some changes you make tonight may be lost. We are working nonstop to resolve this and appreciate your patience"

When stuck, just change the requirements.

rdl · on July 7, 2012

Yeah, I'll bet $100 that there will be a global AWS outage (out to the extent of the April 2011 or June 2012 outages) in the next 3 years, affecting at least two regions for at least one hour of simultaneity.

(I'll assume it will be a routing problem or a software problem.)

cbsmith · on July 7, 2012

Actually, I'd argue the best strategy is to have an alternate cloud provider in each "region".

justinsb · on July 6, 2012

Something constructive I'd like to hear from Netflix is a bit about what it is all those cloud instances actually do :-)

My understanding: Streaming is offloaded to a CDN. The content & user databases change very slowly and thus are very amenable to caching. 'Current position' syncing is very non-critical and so can be done with e.g. writes to independent Redis instances (or even memcache - it really isn't very important!) Ratings, recommendations and the queue seem like the tricky ones, and while I don't think the throughput is particularly high, because they are all per-user this is "trivially shardable" if you do outgrow a single SQL database.

The big question for me is understanding why streaming breaks, because that should all be served from systems that use read-only data? (where read-only = cacheable for at least a day without significant negative consequences)

I think a better understanding of Netflix's technical challenges would serve everyone well here!

bcrescimanno · on July 6, 2012

Read here:

http://techblog.netflix.com/2012/04/netflix-recommendations-...

In parts one and two, it's explained that just about every list generated for users is an up-to-date set of recommendations. You point out yourself that recommendations is one area where caching it's truly viable (at least, not in a traditional sense).

I'm definitely not the right person to go into all the details (nor do I think such a discussion would be prudent on HackerNews)--but I wanted to weigh in quickly that there's a lot of stuff served that goes way beyond the notion of a "static" content that's trivially cached.

justinsb · on July 6, 2012

Are recommendations computed in real-time though? Have you considered e.g. batch recomputation overnight with a 'full' algorithm, and just applying a linearized model to any newly rated content?

I feel like the quality of the Netflix recommendations is not stellar, and if that's because you're constraining yourself to what can be calculated in real-time, I'd willingly trade-off having "perfect" real-time recommendations in favor of better recommendations tomorrow (with the full model). Even if you do try to update recommendations in real time, aren't they easily cacheable if you can't keep up? (Well, as easily cacheable as any dataset on 25 million subscribers can be...)

adrianco · on July 6, 2012

Some stuff is in real time, some is pre-calculated. There is an enormous amount of research and testing going on in this space all the time, its complex and it's evolving fast.

rdl · on July 7, 2012

I don't understand how all the recommendation engine stuff really needs to be in the critical path; 99% of the time, netflix behavior I observe (admittedly, sample size of 2) is "watch next episode of same series." And new series are discovered via referral from someone else, or googling e.g. "post apocalyptic sci fi movies", then figuring out what Netflix has, downloading if unavailable, or Amazoning as absolute worst case. The Netflix recommender doesn't really fit in, so all they need is authentication and authorization, a static URL distributor, and CDN.

adrianco · on July 6, 2012

Visit slideshare.net/netflix and read my architecture slides, there's plenty of detail about how Netflix works available if you have a few hours to look through it.

justinsb · on July 6, 2012

Fascinating stuff... the latest architectural overview is particularly interesting (http://www.slideshare.net/adrianco/netflix-architecture-tuto...) If I had one criticism, I'd love to see a separate overview of the fundamental (CS?) problems, vs. the ephemeral engineering problems (AWS). We all know AWS will go the way of the mainframe (though we may disagree as to timeframes!), but I think e.g. content recommendation algorithms and architectures will forever remain an interesting problem.

Though I'd love to see the monitoring solution open-sourced :-)

adrianco · on July 7, 2012

Monitoring is done with two systems, one in-house in-band that we might open source one day (was called Epic, currently called Atlas). The other is AppDynamics running as a SaaS application with no dependencies on AWS. There is some useful overlap for when one or the other breaks, we merge the alerts from both (plus Gomez etc) but they have very different strengths as tools.

adrianco · on July 7, 2012

I ran one of the recommendation algorithm teams for a few years before we did the cloud migration. The techblog summaries of the algorithms are pretty good. The implementation is lots of fine grain services and data sources, changing continuously. Hard to stick a fork in it and call it done for long enough to document how it works.

adrianco · on July 6, 2012

Netflix is designed to run on two out of three availability zones in a region. There are tens of TB of customer data triple replicated in that region, which has off-region archive but we don't live replicate the data intensive data sources. We also have the Europe region which does live replicate things like membership (since all members are global members of Netflix).

In this case we had some bugs, we should have had a two minute increase in error rate as a third of the clients retried, then the dead instances would have been out of traffic. That's what happened in the previous power outage, where fewer instances went down, and it didn't trigger this bug.

dangrossman · on July 6, 2012

That's something they just started talking about last year, when everything was still in a single region.

http://techblog.netflix.com/2011/04/lessons-netflix-learned-...

Though their post today makes it sound like they could have (and maybe did) fail over to an entirely different region, but their mechanism for doing so isn't automatic and took longer than expected.

paulsutter · on July 6, 2012

It's literally shocking to me that responsible adults would "just start talking" about basic resiliency, like failing over to a different region, for a major commercial service.

There are ways to do it that take seconds (we do that at Quantcast using anycast), there are ways that take minutes (using DNS failover, which is readily available), and there are ways that basically take forever (Netflix guys - feel free to contact me).

It's pretty clear they were aware of the problem, and if they had even the simplest and most basic DNS control, they could have moved people over minutes after realizing the problem.

dangrossman · on July 6, 2012

Didn't we just have a discussion of this type of comment today? [1]

Being able to move /millions of requests per second/ (this was a Friday night!) on /thousands of servers/ [2] to a different /thousands of servers/ involves more than just routing requests to the right IP. Like having a separate set of /thousands of servers/ capable of handling that much failover, for one. Plus, if you had read the link, what they just started talking about a year after moving to AWS was moving across regions; they're already set up to fail entire availability zones over to other availability zones in the same region.

Neither architecture at scale nor employee experience magically appears from nowhere. It's not even been two years since they moved from a data center to running the site on AWS. Considering how much they've had to learn, and the kinds of tools they've had to build for themselves to manage their kind of scale (read their tech blog!), I think the cheap insults are unwarranted.

1: http://news.ycombinator.com/item?id=4208134

2: http://techblog.netflix.com/2012/02/fault-tolerance-in-high-...

paulsutter · on July 6, 2012

My comment is far from a cheap insult. At Quantcast we handle 300,000 requests a second in 14 different cities, and we can failover between cities in 6 seconds. And we're doing a lot more than just log a pixel on each request, just to be clear. It makes maintenance and upgrades really easy. It was a bit of work, but we're always up.

Even a child knows that when the power fails, you can't turn on the TV. This isn't specialized technical knowledge.

EDIT: I'm not going to keep responding to your comment below. I'm certain that if I were involved in the design of Netflix' infrastructure, they would be able to survive problems that affect whole regions. (AND I DONT SEE THAT NETFLIX NEEDS 6 SECOND FAILOVER, THERE ARE MUCH EASIER WAYS TO DO IT IN MINUTES).

EDIT: My repeated, emphatic comments are intended to serve a purpose. Everyone should be aware that this is a real problem and you need to plan for it, and it's pretty clear from the discussions on HN that people are surprised by the Amazon downtime. I personally think Amazon does a fantastic job as it is, and Amazon's reliability issues are not an excuse for the downtime of their customers.

dangrossman · on July 6, 2012

You're doing exactly what that article was describing, and calling Netflix's engineering team irresponsible children is a cheap insult. It's extremely insulting that in the process of saying you didn't insult them, you call them children again. So is repeatedly setting up and taking jabs at this straw man that they had no failover system at all, a year ago or last week.

You're basically saying "if I were in charge, Netflix would've been able to fail over to another region in 6 seconds". But you don't even have a fraction of the background info required to say such a thing. It doesn't matter that you've done reliability at Quantcast; Quantcast is not Netflix.

brown9-2 · on July 6, 2012

Perhaps Netflix merely made a judgment call about the amount of effort required to be able to migrate 100% of their production traffic from one region to another in event of downtime.

It sounds like they came to a different decision than Quantcast about the importance of being able to do so.

Which is more likely - that the thought of an AWS outage like this never occurred to them, or that they judged the specific remedy needed to overcome it not a high enough of a priority to have it available within the first years of their AWS usage?

breidh · on July 6, 2012

Curious, how did you come to possess enough knowledge of the Netflix application architecture to believe the techniques that have served you well at Quantcast could unconditionally apply to Netflix?

tedunangst · on July 7, 2012

Obviously, it works at quantcast. How could it not work elsewhere?

smartwater · on July 7, 2012

He would make the necessary architecture changes. It wouldn't have to be unconditional.

snowwrestler · on July 6, 2012

I think it's worth considering that Netflix is the single largest source of downstream Internet traffic in all of North America. The most recent estimates I could find--from about a year ago--are that at peak, Netflix streaming consumes between about 20% and 30% of all NA available downstream bandwidth. So while Quantcast is obviously a highly available service, it seems likely to me the server resources involved are probably quite a bit higher at Netflix.

dangrossman · on July 6, 2012

The video stream itself is coming off a separate CDN, not their EC2 infrastructure, AFAIK. So that's probably not part of the comparable stuff. What's running on EC2 is the website, the authentication services for all the devices they support, the API for websites and all those devices they support, the recommendation engine, and perhaps a couple million people pinging them every few seconds with the status of their streams. I don't know how that works, but they must be talking to something since reloading a stream on any device will resume where you left off.

cbsmith · on July 7, 2012

I'd argue the recent deploy to AWS actually ought to increase the ease with which you can fail over: it's basically a new deployment to AWS. You can/should be able to use the same deployment process as a starting point for fail overs, which is a huge advantage as compared to a cluster that nobody knows exactly how it was setup.

Xorlev · on July 6, 2012

It's literally shocking to me that you'd be belligerent about it. That's not how adults solve problems.

AWS doesn't support multicast or anycast, and the AWS EC2 control planes were so hosed it was impossible to recover in any meaningful way. Certainly, the fault was both AWS and Netflix, but both are learning from their mistakes.

maukdaddy · on July 6, 2012

I'm sure it didn't literally shock you.

dionidium · on July 7, 2012

It's called hyperbole and it's perfectly acceptable usage and we all -- every single one of us! -- knew exactly what he meant and dear god am I tired of people using this smug and useless retort.

paulsutter · on July 6, 2012

Lit·er·al·ly/ˈlitərəlē/ Adverb: 1. In a literal manner or sense; exactly: "the driver took it literally when asked to go straight over the traffic circle". 2. Used to acknowledge that something is not literally true but is used for emphasis or to express strong feeling.

sofal · on July 6, 2012

Yes, the second definition was added because people lazily try to leverage the first definition to intensify the exaggeration of their figurative statements. The problem is that the second definition now masks the first (because it is essentially a devolved form of the first and actually relies on the first definition), making it difficult in many cases to get across the meaning of the first definition without resorting to other words. I see you've embraced this unfortunate evolution. Some of us are still mourning it.

aw3c2 · on July 6, 2012

Language evolves and changes. Please do not bitch about that in unrelated discussions on HN.

jonny_eh · on July 6, 2012

I think this story from the front page of HN is relevant: http://pilif.github.com/2012/07/armchair-scientists/

paulsutter · on July 6, 2012

I'm not an armchair scientist, look at my bio. I was personally the driver behind the resiliency that we have at Quantcast, and if we had done anything less we would not have succeeded.

I'm happy to help anyone who is serious about uptime, just email me. There's a reason that I post under my actual name, and there's a reason that I make my contact information avilable.

ams6110 · on July 6, 2012

Based on your tone here, I don't think you could really "help" anyone. Maybe I'm wrong, but you're coming off like a complete a-hole

NDizzle · on July 6, 2012

You may not be an armchair scientist but you are sure acting like a huge douchebag in this thread.

Chill out.

askbill · on July 6, 2012

It sounds like they detected it but since so many components failed at once, the action was to page someone to have a look. I'm not convinced they can't actually do this, but rather were gun-shy about programmatically pulling the trigger.

Back to AWS, the control plane failures are concerning, killing the ability to deploy new resources. It's expensive to have this kind of capacity pre-deployed and I'm sure it bit many folks.

eragnew · on July 6, 2012

Dear Netflix, thank you for being transparent and honest about what happened.

soup10 · on July 6, 2012

Anyone find it really weird that netflix doesn't run it's own datacenters?

If the netflix business fails, they would have giant valuable datacenters leftover. Instead by relying on the cloud, they are "all-in" on serving movies. The movie and tv studios have giant leverage here, they can easily make or back a competing service and users will go where the content is. Is their strategy for being on the cloud really, "it's easier than doing it ourselves?".

whichdan · on July 6, 2012

Maybe not necessarily easier, but it isn't part of their focus. Besides having collateral (the datacenters) what else do they gain?

It's not like Amazon where they're providing infrastructure to other companies. I remember someone else on HN pointing out that there aren't many non-adult video providers the size of Netflix/YouTube/etc that aren't already rolling their own solutions or served by companies like Brightcove.

ams6110 · on July 6, 2012

By not having datacenters they don't have capital tied up in buildings, real-estate, staff/benefits at those data centers, they can be much smaller personnel wise and having much more of the staff focused on stuff that matters to customers. Customers don't care where the data center is or who is running it, as long as their movies come on when they want to watch.

mikeash · on July 7, 2012

It sounds like you're recommending that Netflix should be in the data center business, with movies almost a side consequence of it. That certainly might be viable, but it's not the business they're actually in, and apparently not the business they want to be in. Additionally, data centers are a dime a dozen, while good streaming movie providers are much less common.

edouard1234567 · on July 6, 2012

Netlfix always seems to be ahead of its time and I respect them for that! ;)

edouard1234567 · on July 6, 2012

RESILIENCY = REDUNDANCY + INSULATION. Great post. I look forward to hear what heroku is cooking, I heard they are working hard on better handling similar incidents. Redundancy without insulation is what happened to the titanic and seems to be the most common mistake when architecting HA systems.

ctulek · on July 6, 2012

"The service that keeps track of the state of the world has a fail-safe mode where it will not remove unhealthy instances in the event that a significant portion appears to fail simultaneously."

You should keep your logics dumb.

Domenic_S · on July 6, 2012

That was my thought as well when reading that sentence (actually, I was thinking "was this overengineered for no good reason?), however they go on to say that there is a purpose -- mitigating "network partition events" which I can only guess is referring to AWS's version of netsplits.

It sounds like there was some technical debt to that implementation, but hey, I for one am glad they gave us some insight into what happened.

adrianco · on July 7, 2012

"Technical debt" is a nice way of saying it had bugs. It was mostly a configuration problem, if it had been setup better we would have had no outage or a much shorter one. The work to test all our zone level resilience (Chaos Gorilla) was underway but hadn't got far enough to uncover this bug.