Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> For example we use Vertica and DBA told us that Vertica loves wide tables with many columns, which doesn't look very Kimball to me.

I have heard advice like this from colleagues and frankly I don't buy it. It certainly isn't gospel. I think it's an oversimplification.

Columnar stores love star schemas. You can get away with a single table model too but you still need some kind of dimensional or at least domain-based thinking. Your single table is going to basically be a Kimball model but already joined together.

No database is going to be happy with joining orders and billing. The single table is still just going to be a single fact table, you just degenerate all the dimensions.

Personally I think you can gain a lot of benefit from doing proper stars because you get more sorting options but I'm a Redshift guy so maybe I'm stuck in that headspace.

I'm still waiting for someone to come along and propose something different but honestly Kimball's dimensional mental model still resonates with me. Are there compromises, can you relax the model more? Of course, but you're still going to realize huge benefits from starting with that approach. I don't think there is some "new" way of thinking that really changed the data space. All the innovation is on the compute side.

I have precisely zero Vertica experience so maybe I'm totally missing something. I'd be happy for someone to tell me I'm wrong.



I think Kimball has one benefit that single table doesn't have i.e. it's much more business intuitive. For example it's really difficult for me now to explain to our new hire why two completely unrelated fields are stuck in one table and one is NULL for 95% of the time.

The wide table thing is also kind of certified by someone inside of Vertica so I guess it does make sense. I'm not actually working in data team so I don't have the CS knowledge to prove or disprove it. Fortunately our new data lead is going to convert everything into Kimball so we are going to have solid proof eventually.


The CEO of Chartio recently gave a talk advocating for wide tables [1], but I don't fully agree.

Having a wide table can have some performance benefits, but I think the primary benefit of dimensional modelling isn't related to performance, but having standardized definitions, or "conformed" dimensions. For example, having a "calendar" dimension with attributes you can pull into your analysis such as fiscal quarter or reporting week.

In a data model with wide tables, if you need to change the definition of a dimension, it's going to be painful to track down all the different places where that old dimension is used.

[1] https://www.youtube.com/watch?v=3OcS2TMXELU


Those common definitions are priceless. On top of not repeating yourself you also avoid questions like “why does carlineng’s report say it is week 43 but mulmen’s report shows week 44? This is even worse if we both calculated revenue but I did SUM(payment) and you used SUM(price*qty).

I do think there’s a potential for performance benefits too because you can sort the fact and dimensions in ways that make sense for each one. Calendar is sorted by week number, product is sorted by a category. Fact is sorted by a date and locale, etc. since you don’t materialize the full projection more expensive but flexible sorting like interleaving becomes possible.

It’s probably academic but it’s not as hard as people make it sound and I think there are real benefits over these wide tables with simple sorting.

Plus if you use a tool like Looker it will write the reporting SQL for you! Generating star schema queries is easy once the tables are in place.


Yeah I think this is a common misconception with columnar stores. That if they like wide tables (they enable wide tables) that must mean the wider the better.

Sorting (or partitioning) is one of the most powerful optimizations in your toolbox. But only when optimized for some kind of access pattern. When you combine domains to get more width you have to make a compromise on the sorting. Then the wheels come off.

You still need different tables for clicks and orders and payments, even if they are very wide. You may or may not physically conform your dimensions in pure Kimball style but logically you (should) still start there.


Yeah I really wish I could understand all these. There are too many words. Vertica, kafka, spark, and we use all of them. Figured I got to at least know their fundamentals to make good choices.


Yeah I don’t have experience with all the tools. I’m sure they are great and have their strengths. My current setup uses both EMR and Redshift but the data model is the same on both.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: