Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>I'm dealing with clients that insist on writing custom applications to do complex ETL instead of using SSIS or decent third party tool.

I have some beefs with SSIS (which I bring up to the SSIS team lead, Matt Masson, every time we meet) but it's not as bad as so many folks seem to make it out to be. However, what I always see is a never-ending shortage of clients that write their own ETL systems that have 10-20% of the functionality, and 10-20% of the performance (and 10-20x the bugs). Even my current client seems to be doing just this.

Replacing homegrown ETL systems with SSIS, ODI or PowerCenter implementations is a great way to make money. I've seen ETL systems that folks were mighty proud of that had throughput (granted, to some very questionably modeled "data warehouses") measured in bytes/sec. Yes, bytes. We're talking 3 hours to get an 2MB file into the final fact table with some custom "framework" written in C# or Java that uses a web service for all message passing for servers within the same rack (and will always be in the same rack). Again, not necessarily the framework's fault either, just how poorly it's used, and how poor the data is modeled (in the above case, I was able to rewrite it with their tools and get the 3 hour job completing in ~50 seconds after a few days work and tearing through their framework's source code repo). The bugs you can find are fun too. My favorite was when the web service endpoints weren't reusing the same connection, so they were exhausting all the ephemeral TCP ports on the database server's TCP/IP stack -- when the job ran ~200 times faster, since the messaging was so chatty, the ETL framework was basically DoSing their SQL Server (for any service/application using a non-specific port).

Honestly, if you enjoy making things 100 - 10000x faster, live in the database world. The market will present an infinite number of opportunities for you to do so with commodity DBMSes such as SQL Server, MySQL, Postgres and Oracle. Most of it is simply cleaning up bad data modeling decisions, cleaning up a complete misunderstanding of how their database engine works, and overly complicated systems doing very simple things.

-----

I do agree that SQL Server licensing, especially when using Enterprise, can hurt. A lot. It's really set up for scale-up architectures (and Windows clustering could stand to be improved quite a bit), and if you dare deviate from that notion, you get hurt rather badly.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: