Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Has anyone had some good experiences with Spark?

I put several weeks in to moving our machine learning pipeline over to Spark only to find I kept hitting a race condition in their scheduler.

After doing a bit of searching, it seems this is actually a known issue https://issues.apache.org/jira/browse/SPARK-4454 and there's been a fix on their github for a while: https://github.com/apache/spark/pull/3345 and yet in that time two releases have swung by bringing a tonne of features.

I ended up having to drop Spark ultimately because I wasn't confident about putting it in to production (the random OOMs and NPEs during development weren't great either). Does anyone have any positive experiences?



Spark is less mature than Hadoop, so you will run into issues like this. In my experience, advocating for the bug to get fixed often results in it getting fixed... on a several month timeline. This happened with Avro support in Python. I advocated for the patch and someone supplied it in the next version of Spark.

Lemme tell you though... as someone that has use Hadoop for 5+ years... not waiting 5-10 minutes every time you run new code is worth the trouble. Despite more problems owing to immaturity, or just Spark 'doing less for you' in terms of data validation than other tools like Pig/Hive, if you can get your stuff running on Spark... development is joyous. You just don't have to wait very long during development anymore.

I feel like 5 years of my life were delayed 10 minutes. That did terrible things to my coding that I'm just starting to get over. With Spark I am 10x as productive, and I am limited by my thinking, not the tools.

PySpark in particular is really great.


Seriously ./spark-shell is a godsend for development.

And I love the fact you can press Tab and get autocompletion of methods.


Hey - sorry you had a bad experience. That bug was filed as a "minor" issue with only one user ever reporting it, so it didn't end up high up in our triage. We didn't merge the pull request because it was not correct, however, we can just add our own fix for it if it's affecting users. In the future, if you chime in on a reported JIRA, it will escalate it in our process.


Sorry, didn't mean to come over completely negative.

I appreciate the work that's gone in to Spark and it's clearly well designed. Developing with Spark after coming from a Hadoop background was a very refreshing experience.


No worries. Hopefully you'll reconsider using it!


We've recently adopted Spark SQL and our queries run 5 - 200x times faster than with Hive.

Our experience with Spark Streaming, on the other hand, has been mixed. Our streaming app runs stably most of the time (up to 4 days in some cases), but we still see the occasional failure, sometimes with no exception or stack trace indicating what failed.

Our goal is to have a 24/7 streaming service, and Spark has gotten us close to that. There are just a couple of unexplained errors standing in our way.


I'm building distributed ml on top of spark and found it to be good overall. I've had to work out issues with partitioning and mini batching, but I've had a good time so far. The data frame initiative will certainly help things. The JVM ecosystem needs a scientific environment like python (pandas,scipy,..). The potential is there with scala as we're seeing here today.


Our experience was as a Python shop who was backed into a corner to use Apache Pig for our Hadoop batch jobs.

We decided to rewrite some of those jobs from Pig to PySpark, and though there was a little bit of a learning curve and some sharp edges, the development experience is so much better than Pig that my team is generally happy with the switch.


PySpark is really compelling for Pig/Python shops. If it weren't for Pig on Spark, I'd fear for Pig's future.


Spark is pretty fantastic from our perspective. People just think about it in terms of a faster Hadoop MR but it is so much more. The APIs and integration with external systems are so much easier and more intuitive to use.

It really is Hadoop 2.0.


We've been using it in production for about half a year now and it's been great. It especially shines when you have iterative algorithms were you e.g. first need to group-by something then process that a bit, then split it out again and process that a bit more etc. This kind of task is just so much faster in Spark than in MapReduce, it doesn't even compare. Also their API is much nicer and so are the underlying ideas (RDDs in particular). I think it will mostly replace MR in the future. If you are starting a new project now I see few reasons to not use Spark if it fits the bill.


I've used Spark quite successfully for a few small jobs and I love it. I'm sure robustness will improve over time (haven't been bothered by any immaturity myself), but as a system that supports a variety of Big Data processing styles like batch, streaming, graph, ML, etc. so well and is clearly bridging the gap between distributed systems and more traditional analysis languages, I think it's very exciting.


Btw, there is an new PR by Josh Rosen to fix SPARK-4454 here: https://github.com/apache/spark/pull/4660


I've been successfully using Spark in production since 0.7, across three or four significantly different projects.

I don't think I could bring myself to ever write another Hadoop job.


I do, mostly using GraphX, ran a pretty big EC2 cluster using Spark 1.2 (hundreds of cores) and had no significant issues.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: