Has anyone had some good experiences with Spark? I put several weeks in to movin...

rjurney · on Feb 17, 2015

Spark is less mature than Hadoop, so you will run into issues like this. In my experience, advocating for the bug to get fixed often results in it getting fixed... on a several month timeline. This happened with Avro support in Python. I advocated for the patch and someone supplied it in the next version of Spark.

Lemme tell you though... as someone that has use Hadoop for 5+ years... not waiting 5-10 minutes every time you run new code is worth the trouble. Despite more problems owing to immaturity, or just Spark 'doing less for you' in terms of data validation than other tools like Pig/Hive, if you can get your stuff running on Spark... development is joyous. You just don't have to wait very long during development anymore.

I feel like 5 years of my life were delayed 10 minutes. That did terrible things to my coding that I'm just starting to get over. With Spark I am 10x as productive, and I am limited by my thinking, not the tools.

PySpark in particular is really great.

threeseed · on Feb 17, 2015

Seriously ./spark-shell is a godsend for development.

And I love the fact you can press Tab and get autocompletion of methods.

pwendell · on Feb 17, 2015

Hey - sorry you had a bad experience. That bug was filed as a "minor" issue with only one user ever reporting it, so it didn't end up high up in our triage. We didn't merge the pull request because it was not correct, however, we can just add our own fix for it if it's affecting users. In the future, if you chime in on a reported JIRA, it will escalate it in our process.

super_sloth · on Feb 17, 2015

Sorry, didn't mean to come over completely negative.

I appreciate the work that's gone in to Spark and it's clearly well designed. Developing with Spark after coming from a Hadoop background was a very refreshing experience.

pwendell · on Feb 18, 2015

No worries. Hopefully you'll reconsider using it!

zgm · on Feb 17, 2015

We've recently adopted Spark SQL and our queries run 5 - 200x times faster than with Hive.

Our experience with Spark Streaming, on the other hand, has been mixed. Our streaming app runs stably most of the time (up to 4 days in some cases), but we still see the occasional failure, sometimes with no exception or stack trace indicating what failed.

Our goal is to have a 24/7 streaming service, and Spark has gotten us close to that. There are just a couple of unexplained errors standing in our way.

agibsonccc · on Feb 17, 2015

I'm building distributed ml on top of spark and found it to be good overall. I've had to work out issues with partitioning and mini batching, but I've had a good time so far. The data frame initiative will certainly help things. The JVM ecosystem needs a scientific environment like python (pandas,scipy,..). The potential is there with scala as we're seeing here today.

pixelmonkey · on Feb 17, 2015

Our experience was as a Python shop who was backed into a corner to use Apache Pig for our Hadoop batch jobs.

We decided to rewrite some of those jobs from Pig to PySpark, and though there was a little bit of a learning curve and some sharp edges, the development experience is so much better than Pig that my team is generally happy with the switch.

rjurney · on Feb 17, 2015

PySpark is really compelling for Pig/Python shops. If it weren't for Pig on Spark, I'd fear for Pig's future.

threeseed · on Feb 17, 2015

Spark is pretty fantastic from our perspective. People just think about it in terms of a faster Hadoop MR but it is so much more. The APIs and integration with external systems are so much easier and more intuitive to use.

It really is Hadoop 2.0.

rkrzr · on Feb 17, 2015

We've been using it in production for about half a year now and it's been great. It especially shines when you have iterative algorithms were you e.g. first need to group-by something then process that a bit, then split it out again and process that a bit more etc. This kind of task is just so much faster in Spark than in MapReduce, it doesn't even compare. Also their API is much nicer and so are the underlying ideas (RDDs in particular). I think it will mostly replace MR in the future. If you are starting a new project now I see few reasons to not use Spark if it fits the bill.

TallGuyShort · on Feb 17, 2015

I've used Spark quite successfully for a few small jobs and I love it. I'm sure robustness will improve over time (haven't been bothered by any immaturity myself), but as a system that supports a variety of Big Data processing styles like batch, streaming, graph, ML, etc. so well and is clearly bridging the gap between distributed systems and more traditional analysis languages, I think it's very exciting.

nchammas · on Feb 17, 2015

Btw, there is an new PR by Josh Rosen to fix SPARK-4454 here: https://github.com/apache/spark/pull/4660

kod · on Feb 17, 2015

I've been successfully using Spark in production since 0.7, across three or four significantly different projects.

I don't think I could bring myself to ever write another Hadoop job.

eranation · on Feb 18, 2015

I do, mostly using GraphX, ran a pretty big EC2 cluster using Spark 1.2 (hundreds of cores) and had no significant issues.