Something constructive I'd like to hear from Netflix is a bit about what it is all those cloud instances actually do :-)
My understanding: Streaming is offloaded to a CDN. The content & user databases change very slowly and thus are very amenable to caching. 'Current position' syncing is very non-critical and so can be done with e.g. writes to independent Redis instances (or even memcache - it really isn't very important!) Ratings, recommendations and the queue seem like the tricky ones, and while I don't think the throughput is particularly high, because they are all per-user this is "trivially shardable" if you do outgrow a single SQL database.
The big question for me is understanding why streaming breaks, because that should all be served from systems that use read-only data? (where read-only = cacheable for at least a day without significant negative consequences)
I think a better understanding of Netflix's technical challenges would serve everyone well here!
In parts one and two, it's explained that just about every list generated for users is an up-to-date set of recommendations. You point out yourself that recommendations is one area where caching it's truly viable (at least, not in a traditional sense).
I'm definitely not the right person to go into all the details (nor do I think such a discussion would be prudent on HackerNews)--but I wanted to weigh in quickly that there's a lot of stuff served that goes way beyond the notion of a "static" content that's trivially cached.
Are recommendations computed in real-time though? Have you considered e.g. batch recomputation overnight with a 'full' algorithm, and just applying a linearized model to any newly rated content?
I feel like the quality of the Netflix recommendations is not stellar, and if that's because you're constraining yourself to what can be calculated in real-time, I'd willingly trade-off having "perfect" real-time recommendations in favor of better recommendations tomorrow (with the full model). Even if you do try to update recommendations in real time, aren't they easily cacheable if you can't keep up? (Well, as easily cacheable as any dataset on 25 million subscribers can be...)
Some stuff is in real time, some is pre-calculated. There is an enormous amount of research and testing going on in this space all the time, its complex and it's evolving fast.
I don't understand how all the recommendation engine stuff really needs to be in the critical path; 99% of the time, netflix behavior I observe (admittedly, sample size of 2) is "watch next episode of same series." And new series are discovered via referral from someone else, or googling e.g. "post apocalyptic sci fi movies", then figuring out what Netflix has, downloading if unavailable, or Amazoning as absolute worst case. The Netflix recommender doesn't really fit in, so all they need is authentication and authorization, a static URL distributor, and CDN.
Visit slideshare.net/netflix and read my architecture slides, there's plenty of detail about how Netflix works available if you have a few hours to look through it.
Fascinating stuff... the latest architectural overview is particularly interesting (http://www.slideshare.net/adrianco/netflix-architecture-tuto...) If I had one criticism, I'd love to see a separate overview of the fundamental (CS?) problems, vs. the ephemeral engineering problems (AWS). We all know AWS will go the way of the mainframe (though we may disagree as to timeframes!), but I think e.g. content recommendation algorithms and architectures will forever remain an interesting problem.
Though I'd love to see the monitoring solution open-sourced :-)
Monitoring is done with two systems, one in-house in-band that we might open source one day (was called Epic, currently called Atlas). The other is AppDynamics running as a SaaS application with no dependencies on AWS. There is some useful overlap for when one or the other breaks, we merge the alerts from both (plus Gomez etc) but they have very different strengths as tools.
I ran one of the recommendation algorithm teams for a few years before we did the cloud migration. The techblog summaries of the algorithms are pretty good. The implementation is lots of fine grain services and data sources, changing continuously. Hard to stick a fork in it and call it done for long enough to document how it works.
My understanding: Streaming is offloaded to a CDN. The content & user databases change very slowly and thus are very amenable to caching. 'Current position' syncing is very non-critical and so can be done with e.g. writes to independent Redis instances (or even memcache - it really isn't very important!) Ratings, recommendations and the queue seem like the tricky ones, and while I don't think the throughput is particularly high, because they are all per-user this is "trivially shardable" if you do outgrow a single SQL database.
The big question for me is understanding why streaming breaks, because that should all be served from systems that use read-only data? (where read-only = cacheable for at least a day without significant negative consequences)
I think a better understanding of Netflix's technical challenges would serve everyone well here!