> For example we use Vertica and DBA told us that Vertica loves wide tables with...

markus_zhang · on Jan 11, 2021

I think Kimball has one benefit that single table doesn't have i.e. it's much more business intuitive. For example it's really difficult for me now to explain to our new hire why two completely unrelated fields are stuck in one table and one is NULL for 95% of the time.

The wide table thing is also kind of certified by someone inside of Vertica so I guess it does make sense. I'm not actually working in data team so I don't have the CS knowledge to prove or disprove it. Fortunately our new data lead is going to convert everything into Kimball so we are going to have solid proof eventually.

carlineng · on Jan 12, 2021

The CEO of Chartio recently gave a talk advocating for wide tables [1], but I don't fully agree.

Having a wide table can have some performance benefits, but I think the primary benefit of dimensional modelling isn't related to performance, but having standardized definitions, or "conformed" dimensions. For example, having a "calendar" dimension with attributes you can pull into your analysis such as fiscal quarter or reporting week.

In a data model with wide tables, if you need to change the definition of a dimension, it's going to be painful to track down all the different places where that old dimension is used.

[1] https://www.youtube.com/watch?v=3OcS2TMXELU

mulmen · on Jan 12, 2021

Those common definitions are priceless. On top of not repeating yourself you also avoid questions like “why does carlineng’s report say it is week 43 but mulmen’s report shows week 44? This is even worse if we both calculated revenue but I did SUM(payment) and you used SUM(price*qty).

I do think there’s a potential for performance benefits too because you can sort the fact and dimensions in ways that make sense for each one. Calendar is sorted by week number, product is sorted by a category. Fact is sorted by a date and locale, etc. since you don’t materialize the full projection more expensive but flexible sorting like interleaving becomes possible.

It’s probably academic but it’s not as hard as people make it sound and I think there are real benefits over these wide tables with simple sorting.

Plus if you use a tool like Looker it will write the reporting SQL for you! Generating star schema queries is easy once the tables are in place.

mulmen · on Jan 11, 2021

Yeah I think this is a common misconception with columnar stores. That if they like wide tables (they enable wide tables) that must mean the wider the better.

Sorting (or partitioning) is one of the most powerful optimizations in your toolbox. But only when optimized for some kind of access pattern. When you combine domains to get more width you have to make a compromise on the sorting. Then the wheels come off.

You still need different tables for clicks and orders and payments, even if they are very wide. You may or may not physically conform your dimensions in pure Kimball style but logically you (should) still start there.

markus_zhang · on Jan 12, 2021

Yeah I really wish I could understand all these. There are too many words. Vertica, kafka, spark, and we use all of them. Figured I got to at least know their fundamentals to make good choices.

mulmen · on Jan 12, 2021

Yeah I don’t have experience with all the tools. I’m sure they are great and have their strengths. My current setup uses both EMR and Redshift but the data model is the same on both.