Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I wholeheartedly agree with the "specialization" comment.

Unless you are in a position where you can entirely rely on managed tools that do the work for you and all effort is centered around managing the data, rather than the holistic view of your data pipelines (Talend ETL, Informatica - the "pre-Hadoop" world, if you will, and maybe some modern tools like Snowflake), then a good Data Engineer needs a deep understanding of programming languages, networking, some sysadmin stuff, distributed systems, containerization, statics, and of course a good "architect" view on the ever-growing zoo of tools and languages with different pros and cons.

Given that at the end of the day, most "Data Pipelines" run on distributed Linux machines, I've seen and solved endless issues with Kernel and OS configurations (noexec flags, ulimits, permissions, keyring limits ...), network bottlenecks, hotspotting (both in networks and databases), overflowing partitions, odd issues on odd file systems, bad partition schemes, a myriad of network issues, JVM flags, needs for auditing and other compliance topics, heavily multi-threaded custom implementations that don't use "standard" tools and rely on language features (goroutines, multiprocessing in Python, Threadpools in Java ...), encoding problems, various TLS and other security challenges, and of course, endless use of GNU tools and other CLI-fun and I would not necessarily expect for a pure SQL use case (not discounting the fact that SQL is, in fact, very important).

Not to mention that a lot of jobs / workflows Data Engineers design and write tend to be very, very expensive, especially on managed Clouds - generally a good idea to make sure everything works and your engineers understand what they are doing.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: