Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

minimaxir · on July 10, 2016

If you haven't kept up with Spark, and do not have Amazon-level workloads, the built-in machine learning APIs are extremely robust and well-documented (https://people.apache.org/~pwendell/spark-nightly/spark-mast...), even in the non-Scala languages like Python. There's even a Multilevel Perceptron model creation function, which creates artificial neural networks using the feed-forward/backpropagation everyone loves.

The new Spark DataFrames also make manipulating data almost as easy as with Python Pandas / R dplyr. The new Berkeley eDX course (https://courses.edx.org/courses/course-v1:BerkeleyX+CS105x+1...) is a very good explainer.

After Spark 2.0.0 is released, it wouldn't surprise me if it really takes off. (as long as setting up a cluster is a bit easier!)

nchammas · on July 10, 2016

> as long as setting up a cluster is a bit easier!

If you're on AWS, it's already quite easy to set up a cluster today, no?

There's EMR of course, and there are tools like spark-ec2 [0] and Flintrock [1].

There are a few more tools listed on Spark Packages that target different cloud providers [2], too.

Disclaimer: I am the primary author of Flintrock and am a contributor to spark-ec2.

[0] https://github.com/amplab/spark-ec2

[1] https://github.com/nchammas/flintrock

[2] https://spark-packages.org/?q=tags%3Adeployment

minimaxir · on July 10, 2016

Setting up Spark clusters is easy relative to setting up clusters, but not easy enough yet to set up and configure relative to simply downloading a package in R/Python.

nchammas · on July 10, 2016

Well, the absolute easiest way to run Spark is to do it locally (e.g. you can brew install it on a Mac and just go) or to pay for a proprietary service like Databricks, which makes setting up a cluster take a few clicks.

That said, I think `flintrock launch my-cluster` is almost as easy as doing `pip install ...`.

You do need an AWS account and you do need to set your preferences like region and key name in a config file, but I don't see how you can get out of doing even that without subscribing to some managed service like Databricks that abstracts everything away and replaces it with a nice Web UI.

mgummelt · on July 11, 2016

If you're running a DC/OS cluster, it actually is!

> dcos package install spark

This gives you a Spark CLI, and installs the Spark Cluster Dispatcher so you can run jobs async. It also has options for installing the history server, and for SSL/Kerberos support.

Disclaimer: I work at Mesosphere and contribute to the DC/OS Spark package.

vonnik · on July 10, 2016

We already built what Amazon was looking for: An extensible deep-learning framework that works on distributed CPUs and GPUs heterogeneously, integrating with Spark as an access layer to orchestrate multiple host threads.

https://github.com/deeplearning4j

DL4J may be the DL library with the most sophisticated Spark integration at this point. The trick is to avoid using Spark as a computation layer, since it doesn't do that well.

We're pushing a CuDNN wrapper tomorrow.

http://deeplearning4j.org/quickstart

Unlike DSSTNE, Tensorflow or CNTK, our deep learning library is neutral, and not designed with the intention of locking people into to a cloud service.

scottlegrand2 · on July 10, 2016

Wow, defensive much? Afraid of a little competition maybe?

Anyway, DSSTNE itself is not in any way locked to any cloud service whatsoever. It is an Apache-licensed library with dependencies on a C++11 compiler, a 7.0 or later CUDA Toolkit and Kepler or better GPU, a C++11-friendly MPI library, netcdf, and libjsoncpp. That's it. Please stop insinuating otherwise.

The article here shows how one could use DSSTNE with Spark for recommendations at scale. Speaking of which, how's your sparse data and model-parallel training support? Because as the author of DSSTNE, the poor support for these features in other frameworks was what forced to us to "roll our own" code in the first place. Everyone else was optimizing for ImageNet-winning CNNs (including NVIDIA). And there's nothing wrong with that, $25M+ companies have been built from that, but it just wasn't the use case here.

Finally, cuSparse is cuSlow for datasets at Amazon. DSSTNE's hand-coded sparse kernels stomp on cuSparse in the same way that Neon's convolution kernels stomp on cuDNN. And there's nothing wrong with that either. CuSparse is a great choice for other sparse data problems, just not Amazon's (and I suspect many other companies) recommendations problems.

Fenntrek · on July 10, 2016

>Unlike DSSTNE, Tensorflow or CNTK, our deep learning library is neutral, and not designed with the intention of locking people into to a cloud service.

How do any of these lock you into a cloud service?

dgacmu · on July 10, 2016

And to reinforce Fenntrek's point: One of the core reasons for open sourcing TensorFlow is to make sure people have the ability to take their code and run it somewhere else. The exact opposite of lock-in. (I can say that with some knowledge - I'm working there right now.) In fact, the cloud service for "hosted" TF isn't even out of alpha yet, but there are many people already using TF on their own hardware.

I can't speak for MSFT or AMZN, but all three of these frameworks are completely open source with permissive licenses.

I seriously doubt that any of these three providers has lock-in as a goal by releasing their stuff OSS. Goodwill? Absolutely. Getting people to learn the technology they care about internally? Certainly. Hoping that people will play with it and find that they want to run their models on 100 nodes rented from a cloud service? Seems likely. For MSFT, wanting to make sure there was a high-performance DNN framework for Windows? Would make sense to me.

But lock-in by giving away your stuff in a way that lets anyone run it on their own hardware or a competitor's cloud? The DSSTNE benchmarks, for example, ran TensorFlow on AWS...

vonnik · on July 11, 2016

I've spoken with the TF team and they've been clear that open-sourcing the lib was part of strategy to make Google Cloud more appealing. It was conceived as a lure, among other things.

dgacmu · on July 11, 2016

There's a big difference between "TensorFlow is great, and Google Cloud is great, and you can bet your butt that TensorFlow will run well on GCE" than "Run TF and get locked in to GCE". In fact, as I read it, the selling point is actually much closer to:

"TensorFlow is great, GCE is great, the two work great together, and it's open source so you can trust that you _won't_ get locked in." In other words, no-lockin is a deliberate part of the value proposition of TF+GCE.

th0ma5 · on July 11, 2016

Could be a great opportunity to push OpenCL development but I guess the whole world is still on CUDA :( so not quite open?

scottlegrand2 · on July 11, 2016

I'm gonna get in trouble saying this. But IMO OpenCL needs to die. IMO it's a half-baked open standard that's hard to use relative to CUDA. IMO what needs to happen is that AMD and Intel need to provide CUDA runtime (just say no to the driver API) API interfaces to their hardware.

Evidence: OpenGL is a wonderful open standard. But OpenGL happened after nearly a decade of evolution of closed IrisGL that ironed out a lot of weird quirks that infest OpenCL to this day.

I joined the CUDA team in March of 2006 (that's why I have 14 single-inventor CUDA patents, including one that is the basis of persistent RNNS by Baidu). IMO CUDA is a mature API. And CUDA needs to escape NVIDIA's vendor lock. AMD is attempting to make this happen with HIP and ROC.

http://gpuopen.com/getting-started-with-boltzmann-components... http://www.anandtech.com/show/9792/amd-sc15-boltzmann-initia...

IMO as CUDA team member #6, we need to help AMD with this effort. To that end, we need to let go of OpenCL: I'm sorry but it's a dead-end. And we need to get as much of the existing CUDA code out there running on AMD hardware as we can. And I hope (but I don't think they'll agree because reasons(tm)) that Intel will provide a CUDA runtime API to Xeon Phi (Knights Landing or better) because CUDA is a better API for parallel computation than any other existing contender at this point.

And that's my story and I'm sticking to it. But don't get me started on both Google and Apple refusing to expose OpenCL on mobile where it could have once made a real difference. That time has now passed. CUDA won, but we still don't have a CUDA or OpenCL interface on either Android or IOS.

th0ma5 · on July 12, 2016

Hear, hear

AlexCoventry · on July 10, 2016

What's Deep Learning 4J's performance on the Soumith benchmarks?

https://github.com/soumith/convnet-benchmarks

vonnik · on July 11, 2016

We wrap Cudnn, which among the top performers.

scottlegrand2 · on July 12, 2016

Which doesn't answer the question he asked. CuDNN provides efficient convolution and RNN kernels. It is not a framework itself. There are plenty of ways to screw up aspects of a framework other than its convolution kernels.

Havoc · on July 10, 2016

Their suggestions aren't all that great in my experience. Its either something I looked at/bought before or its something completely arbitrary.