Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's literally shocking to me that responsible adults would "just start talking" about basic resiliency, like failing over to a different region, for a major commercial service.

There are ways to do it that take seconds (we do that at Quantcast using anycast), there are ways that take minutes (using DNS failover, which is readily available), and there are ways that basically take forever (Netflix guys - feel free to contact me).

It's pretty clear they were aware of the problem, and if they had even the simplest and most basic DNS control, they could have moved people over minutes after realizing the problem.



Didn't we just have a discussion of this type of comment today? [1]

Being able to move /millions of requests per second/ (this was a Friday night!) on /thousands of servers/ [2] to a different /thousands of servers/ involves more than just routing requests to the right IP. Like having a separate set of /thousands of servers/ capable of handling that much failover, for one. Plus, if you had read the link, what they just started talking about a year after moving to AWS was moving across regions; they're already set up to fail entire availability zones over to other availability zones in the same region.

Neither architecture at scale nor employee experience magically appears from nowhere. It's not even been two years since they moved from a data center to running the site on AWS. Considering how much they've had to learn, and the kinds of tools they've had to build for themselves to manage their kind of scale (read their tech blog!), I think the cheap insults are unwarranted.

1: http://news.ycombinator.com/item?id=4208134

2: http://techblog.netflix.com/2012/02/fault-tolerance-in-high-...


My comment is far from a cheap insult. At Quantcast we handle 300,000 requests a second in 14 different cities, and we can failover between cities in 6 seconds. And we're doing a lot more than just log a pixel on each request, just to be clear. It makes maintenance and upgrades really easy. It was a bit of work, but we're always up.

Even a child knows that when the power fails, you can't turn on the TV. This isn't specialized technical knowledge.

EDIT: I'm not going to keep responding to your comment below. I'm certain that if I were involved in the design of Netflix' infrastructure, they would be able to survive problems that affect whole regions. (AND I DONT SEE THAT NETFLIX NEEDS 6 SECOND FAILOVER, THERE ARE MUCH EASIER WAYS TO DO IT IN MINUTES).

EDIT: My repeated, emphatic comments are intended to serve a purpose. Everyone should be aware that this is a real problem and you need to plan for it, and it's pretty clear from the discussions on HN that people are surprised by the Amazon downtime. I personally think Amazon does a fantastic job as it is, and Amazon's reliability issues are not an excuse for the downtime of their customers.


You're doing exactly what that article was describing, and calling Netflix's engineering team irresponsible children is a cheap insult. It's extremely insulting that in the process of saying you didn't insult them, you call them children again. So is repeatedly setting up and taking jabs at this straw man that they had no failover system at all, a year ago or last week.

You're basically saying "if I were in charge, Netflix would've been able to fail over to another region in 6 seconds". But you don't even have a fraction of the background info required to say such a thing. It doesn't matter that you've done reliability at Quantcast; Quantcast is not Netflix.


Perhaps Netflix merely made a judgment call about the amount of effort required to be able to migrate 100% of their production traffic from one region to another in event of downtime.

It sounds like they came to a different decision than Quantcast about the importance of being able to do so.

Which is more likely - that the thought of an AWS outage like this never occurred to them, or that they judged the specific remedy needed to overcome it not a high enough of a priority to have it available within the first years of their AWS usage?


Curious, how did you come to possess enough knowledge of the Netflix application architecture to believe the techniques that have served you well at Quantcast could unconditionally apply to Netflix?


Obviously, it works at quantcast. How could it not work elsewhere?


He would make the necessary architecture changes. It wouldn't have to be unconditional.


I think it's worth considering that Netflix is the single largest source of downstream Internet traffic in all of North America. The most recent estimates I could find--from about a year ago--are that at peak, Netflix streaming consumes between about 20% and 30% of all NA available downstream bandwidth. So while Quantcast is obviously a highly available service, it seems likely to me the server resources involved are probably quite a bit higher at Netflix.


The video stream itself is coming off a separate CDN, not their EC2 infrastructure, AFAIK. So that's probably not part of the comparable stuff. What's running on EC2 is the website, the authentication services for all the devices they support, the API for websites and all those devices they support, the recommendation engine, and perhaps a couple million people pinging them every few seconds with the status of their streams. I don't know how that works, but they must be talking to something since reloading a stream on any device will resume where you left off.


I'd argue the recent deploy to AWS actually ought to increase the ease with which you can fail over: it's basically a new deployment to AWS. You can/should be able to use the same deployment process as a starting point for fail overs, which is a huge advantage as compared to a cluster that nobody knows exactly how it was setup.


It's literally shocking to me that you'd be belligerent about it. That's not how adults solve problems.

AWS doesn't support multicast or anycast, and the AWS EC2 control planes were so hosed it was impossible to recover in any meaningful way. Certainly, the fault was both AWS and Netflix, but both are learning from their mistakes.


I'm sure it didn't literally shock you.


It's called hyperbole and it's perfectly acceptable usage and we all -- every single one of us! -- knew exactly what he meant and dear god am I tired of people using this smug and useless retort.


Lit·er·al·ly/ˈlitərəlē/ Adverb: 1. In a literal manner or sense; exactly: "the driver took it literally when asked to go straight over the traffic circle". 2. Used to acknowledge that something is not literally true but is used for emphasis or to express strong feeling.


Yes, the second definition was added because people lazily try to leverage the first definition to intensify the exaggeration of their figurative statements. The problem is that the second definition now masks the first (because it is essentially a devolved form of the first and actually relies on the first definition), making it difficult in many cases to get across the meaning of the first definition without resorting to other words. I see you've embraced this unfortunate evolution. Some of us are still mourning it.


Language evolves and changes. Please do not bitch about that in unrelated discussions on HN.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: