Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

(source for everything following: I recently hired entry-level data engineers)

The experience required differs dramatically between [semi]structured transactional data moving into data warehouses versus highly unstructured data that the data engineer has to do a lot of munging on.

If you're working in an environment where the data is mostly structured, you will be primarily working in SQL. A LOT of SQL. You'll also need to know a lot about a particular database stack and how to squeeze it. In this scenario, you're probably going to be thinking a lot about job-scheduling workflows, query optimization, data quality. It is a very operations-heavy workflow. There are a lot of tools available to help make this process easier.

If you're working in a highly unstructured data environment, you're going to be munging a lot of this data yourself. The "operations" focus is still useful, but at the entry level data engineer, you're going to be spending a lot more time thinking about writing parsers and basic jobs. If you're focusing your practice time on writing scripts that move data in Structure A in Place X to Structure B in Place Y, you're setting yourself up for success.

I agree with a few other commentators here that Hadoop/Spark isn't being used a lot in their production environments - but - there are a lot of useful concepts in Hadoop/Spark that are helpful for data engineers to be familiar with. While you might not be using those tools on a day-to-day basis, chances are your hiring manager used them when she was in your position and it will give you an opportunity you know a few tools at a deeper level.



Agree 100% with this comment,

Old stack: Hadoop, spark, hive, hdfs.

New stack: kafka/kinesis, fivetran/stitch/singer, airflow/dagster, dbt/dataform, snowflake/redshift


Huh, what replaces Spark in those lists?

For my money, its the best distributed ML system out there, so I'd be interested to know what new hotness I'm missing.


distributed ML != Distributed DWH.

Distributed ML is tough to train because of very little control over train loop. I personally prefer using single server trainkng even on large datasets, or switch to online learning algos that do train/inference/retrain at the same time.

as for snowflake, I havent heard of people using snowflake to train ML, but sbnowflake is a killer in managed distribited DWH that you dont have to tinker and tune


> sbnowflake is a killer in managed distribited DWH that you dont have to tinker and tune

How do Snowflake (and Redshift, mentioned above) compare with CitusDB? I really like the PostgreSQL experience offered by Citus. I've been bit by too many commercial databases where the sales brochure promises the product does X, Y, and Z, only to discover later that you can't do any of them together because reasons.


So do I, theoretically at least.

But Spark is super cool and actually has algorithms which complete in a reasonable time frame on hardware I can get access to.

Like, I understand that the SQL portion is pretty commoditised (though even there, SparkSQL python and R API's are super nice), but I'm not aware of any other frameworks for doing distributed training of ML models.

Have all the hipsters moved to GPUs or something? \s

> sbnowflake is a killer in managed distribited DWH that you dont have to tinker and tune

It's so very expensive though, and their pricing model is frustratingly annoying (why the hell do I need tickets?).

That being said, tuning Spark/Presto or any of the non-managed alternatives is no fun either, so I wonder if it's the right tradeoff.

One thing I really, really like about Spark is the ability to write Python/R/Scala code to solve the problems that cannot be usefully expressed in SQL.

All the replies to my original comment seem to forget that, or maybe Snowflake has such functionality and I'm unaware of it.


>I'm not aware of any other frameworks for doing distributed training of ML models.

Tensorflow, PyTorch (not sure if Ray is needed) and Mxnet all support distributed training across CPUs/GPUs in a single machine or multiple machines. So does XGBoost if you don't want deep learning. You can then run them with KubeFlow or on whatever platform your SaaS provider has (GCP AI Platform, AWS Sagemaker, etc.).

edit:

>All the replies to my original comment seem to forget that, or maybe Snowflake has such functionality and I'm unaware of it.

Snowflake has support for custom Javascript UDFs and a lot of built in features (you can do absurd things with window functions). I also found it much faster than Spark.


> Snowflake has support for custom Javascript UDFs and a lot of built in features (you can do absurd things with window functions). I also found it much faster than Spark.

UDF support isn't really the same, to be honest. You're still prisoner of the select from pattern. Don't get me wrong, SQL is wonderful where it works, but it doesn't work for everything that I need.

I completely agree that it's faster than Spark, but it's also super-expensive and more limited. I suspect it would probably be cheaper to run a managed Spark cluster vs Snowflake and just eat the performance hit by scaling up.

Tensorflow, PyTorch (not sure if Ray is needed) and Mxnet all support distributed training across CPUs/GPUs in a single machine or multiple machines. So does XGBoost if you don't want deep learning.

I forgot about Xgboost, but I'm a big fan of unsupervised methods (as input to supervised methods, mostly) and Spark has a bunch of these. I haven't ever tried to do it, but based on my experience of running deep learning frameworks and distributed ML, I suspect the combination of both to be exponentially more annoying ;) (And i deal mostly with structured data, so it doesn't buy me as much).

> You can then run them with KubeFlow or on whatever platform your SaaS provider has (GCP AI Platform, AWS Sagemaker, etc.).

Do people really find these tools useful? Again, I'm not really sure what SageMaker (for example) buys me on AWS, and their pricing structure is so opaque that I'm hesitant to even invest time in it.


>UDF support isn't really the same, to be honest. You're still prisoner of the select from pattern. Don't get me wrong, SQL is wonderful where it works, but it doesn't work for everything that I need.

Not sure how it's different from what you can do in Spark in terms of data transformations. Taking a list of objects as an argument basically allows your UDF to do arbitrary computations on tabular data.

> I forgot about Xgboost, but I'm a big fan of unsupervised methods (as input to supervised methods, mostly) and Spark has a bunch of these.

That's true, distributed unsupervised methods aren't done in most other places I know of. I'm guessing there's ways to do that with neural network although I haven't looked into it. The datasets I deal with have structure in them between events even if they're unlabeled.

>I completely agree that it's faster than Spark, but it's also super-expensive and more limited. I suspect it would probably be cheaper to run a managed Spark cluster vs Snowflake and just eat the performance hit by scaling up.

I used to do that on AWS. For our use case, Athena ate its lunch in terms of performance, latency and cost by an order of magnitude. Snowflake is priced based on demand so I suspect it'd do likewise.


Spark has a superset of the functionality Athena has. Athena is faster, but it's also very limited. They're not designed to do the same thing.


I'd put spark in both lists. Old is spark-sql, new is the programming language interface.


Snowflake I suppose for the average ML use case. Not for your high-performance ML, but for your average data scientist, maybe?

Edit: I may be wrong[1], would be curious to know what users who've used Spark AND Snowflake would add to the conversation.

[1] https://www.snowflake.com/blog/snowflake-and-spark-part-1-wh...


Snowflake hits its limits with complex transformations I feel. Not just due to using SQL. It's "type system" is simpler than Spark's which makes certain operations annoying. There's a lack of UDFs for working with complex types (lists, structs, etc.). Having to write UDFs in Javascript is also not the greatest experience.


> There's a lack of UDFs for working with complex types (lists, structs, etc.). Having to write UDFs in Javascript is also not the greatest experience.

We load our data into SF in json and do plenty of list/struct manipulation using their inbuilt functions[1]. I guess you might have write a UDF if you are doing something super weird but inbuilt functions should get you pretty far 90% of the time.

https://docs.snowflake.com/en/sql-reference/functions-semist...


> best distributed ML system out there

I was comparing it for "traditional" data engineering stack that used spark for data munging, transformations ect.

I don't have much insight into ML systems or how spark fits there. Not all data teams are building 'ml systems' though. Parent comment wasn't referring to any 'ml systems', not sure why that would be automatically inferred when someone mentions data stack .


Yeah, I suppose. I kinda think that distributed SQL is a mostly commoditised space, and wondered what replaced Spark for distributed training.

For context, I'm a DS who's spent far too much time not being able to run useful models because of hardware limitations, and a Spark cluster is incredibly good for that.

Additionally, I'd argue in favour of Spark even for ETL, as the ability to write (and test!) complicated SQL queries in R, Python and Scala was super, super transformative.

We don't really use Spark at my current place, and every time I write Snowflake (which is great, to be fair), I'm reminded of the inherent limitations of SQL and how wonderful Spark SQL was.

I'm weird though, to be fair.


I agree with this.

Along with ML it is also a very high performance extract and transformation engine.

Would love to hear what other tech that are being used to replace Spark.


Can you elaborate more on the "roles" of the "new stack"? To me dbt/dataform and airflow/dagster are quite similar, so why do you need one of each? fivetran/stitch/singer are all new


I've used all of these so I might be able to offer some perspective here

In an ELT/ETL pipeline:

Airflow is similar to the "extract" portion of the pipeline and is great for scheduling tasks and provides the high-level view for understanding state changes and status of a given system. I'll typically use airflow to schedule a job that will get raw data from xyz source(s), do something else with it, then drop it into S3. This can then trigger other tasks/workflows/slack notifications as necessary.

You can think of dbt as the "transform" part. It really shines with how it enables data teams to write modular, testable, and version controlled SQL - similar to how a more traditional type developer writes code. For example, when modeling a schema in a data warehouse all of the various source tables, transformation and aggregation logic, as well as materialization methods are able to to live in the their own files and be referenced elsewhere through templating. All of the table/view dependencies are handled under the hood by dbt. For my organization, it helped untangle the web of views building views building views and made it simpler to grok exactly what and where might be changing and how something may affect something else downstream. Airflow could do this too in theory, but given you write SQL to interface with dbt, it makes it far more accessible for a wider audience to contribute.

Fivetran/Stitch/Singer can serve as both the "extract" and "load" parts of the equation. Fivetran "does it for you" more or less with their range of connectors for various sources and destinations. Singer simply defines a spec for sources (taps) and destinations (targets) to be used as a standard when writing a pipeline. I think the way Singer drew a line in the sand and approached defining a way of doing things is pretty cool - however active development on it really took a hit when the company was acquired. Stitch came up with the singer spec and their offered service is through managing the and scheduling various taps and targets for you.


Airflow allows for more complex transformations of data that SQL may not be suited for. DBT is largely stuck utilizing the SQL capabilities of the warehouse it sits on, so for instance, with Redshift you have a really bad time working with JSON based data with DBT, Airflow can solve this problem. That's one example, but last I was working with it we found DBT was great for analytical modeling type transformations but from getting whatever munged up data into a useable format in the first place Airflow was king.

We also trained our analysts to write the more analytical DBT transformations which was nice, shifted that work onto them.

Don't get me wrong though, you can get really far with just DBT + Fivetran, in fact, it removes like 80% of the really tedious, but trivial ETL work. Airflow is just there for the last 20%

(Plus you can then utilize airflow as a general job scheduler)


Just a thought : what about dremio ?


> I agree with a few other commentators here that Hadoop/Spark isn't being used a lot in their production environments

I guess I'm the odd-man out because that's all I've used for this kind of work. Spark, Hive, Hadoop, Scala, Kafka, etc.


I should have specified more thoroughly.

I am not seeing Spark being chosen for new data eng roll-outs. It is still very prevalent in existing environments because it still works well. (used at $lastjob myself)

However - I am still seeing a lot of Spark for machine-learning work by data scientists. Distributed ML feels like it is getting split into a different toolkit than distributed DE.


I guess it depends on what jobs you're looking for. There's a lot of exiting companies/teams (like mine) looking to hire people but we're on the "old stack" using Kafka, Scala, Spark, etc. We don't do any ML stuff but I'm on the pipeline side of it. The data scientists down the line tend to use Hive/SparkSQL/Athena for a lot of work but I'm much less involved with that.

Not all jobs are new pasture and I think that's forgotten very frequently.


I'd love to do some Kafka, Scala and Spark. What kind of exp are you looking for?


I'm also the odd one out, so many enterprises moving to spark on databricks.


It does rather depends what sort of data I bet a data engineer at CERN or JPL has quiet a different set of required skills to say Google or a company playing at data science because its the next big thing.

I should imagine at CERN etc knowing which end of soldering iron gets hot might still be required in some cases.

I recall back in the mumble extracting data from b&w film shot with a high speed camera, by projecting it on to graph paper taped to the wall and manualy marking the position of the "object"


When I was there, almost 20 years ago, it was all about C++, Python and Fortran, with GUIs being done in Java Swing.

I bet it is still mostly the same, just using Web GUIs nowadays.


We yet have to wait for the proper sweet spot: a language that allows SQL-like handling without the restrictions of SQL.

As many advantages as SQL has, in many cases it gets into the way. The closer you move to moving data (instead of doing analysis), the more it becomes annoying.

On the other hand, current languages (such as python) lack support when it comes to data transformations. Even Scala, which is one of the better languages for this, has severe drawbacks compared to SQL.

Hopefully better type-systems will help us out in the long term, in particular those with dependent types or similar power to describe data relations.


What's your opinion of LINQ in C#? It's been a while since I've used it but to me it seems like one of the most powerful ways to manipulate data inside of a language.


I haven't used LINQ but I have a good idea about how it works. It is certainly great to write concrete SQL-like transformations (be it based on a database, a list, ...).

Where it lacks is abstraction. To make that more concrete, let me ask: can you write LINQ that takes an arbitrary structure and selects every numeric field and calculate the sum over it? And if that is not possible, it should not compile.

I.e. can you define a function "numericsSum(...)" and call it with "List(User(salary: Number, deductions: Number, name: String))" and have it calcuate the sum (ignoring the name-field) but having it fail to compile when calling with "List(User(name: String, date_of_birth: Date))"?

Another example: is it possible to create your own special "join" function, that joins to data structures if they have exactly one common field (which you don't have to specify)?

In both examples, the LINQ compiler must be able to inspect the structure of a type (such as User) at compile time (not runtime) and do pretty arbitrary calculations. Most languages don't support that and I think even LINQ only works with concrete types in that sense. Which, by the way, is already better than what most languages offer - don't get me wrong here. But it is not as powerful as what SQL "compilers" offer - however those are then limited to SQL only, lacking the ability to do general purpose programming.


Great points. It depends on where the business is at, the scale of their data, how processed their data is, and the timeliness/accuracy requirements of that data.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: