Maybe Normalizing Isn't Normal

ars · on July 15, 2008

That 6 join query is wrong, and the rest of the article follows from that incorrect join.

Why would you ever want a query that returned the user data multiple times, once for every combination of screen_name and phone_number?!? No real site would ever want to do that.

If you need phone number, you get phone numbers, if you need screen names, you get those - but why would you need every possible combination of phone number and screen_name?

He's simply wrong. Speeding up joins is never a reason to denormalize for the simple reason that it doesn't!

A valid reason to denormalize is to precalculate data, which he touches on very briefly. But you always write that twice: once normalized, and that's the primary data, and then again, the "cached/precalculated" version. You should always be able to regenerate that from the normalized data.

OK, one final reason to denormalize which he didn't even write: if you need to do a where clause from one table, but the order by from a different table, you need to denormalize because you can not create a combined index from both tables. (Databases with function indexes might be able to but that's pretty complicated.)

gaius · on July 15, 2008

Firstly, he has misunderstood database design. The design that the system analysts product will be in 3NF because their job is to understand the data in its most abstract sense. They will also specify - again, fully normalized - the main queries that will need to be run. Then it will go to the implementors who will transform the tables and the queries for performance. Denormalizing too early is a premature optimization. Not denormalizing at all is inexperienced.

Secondly, if you need to scale you are most likely using a query cache (e.g. Tangosol) between your RDBMS and your middle tier, so you can preserve a far cleaner data model now then you could back in the day.

biohacker42 · on July 15, 2008

I think a lot of people also completely miss the fact that there is more then one way to have a normalized DB. Much like OO design, there's a big gap between smart,good, efficient implementations and the rest.

davidmathers · on July 15, 2008

Signs, of a certain type:

1. A post about relational databases where the word "relation" is never used. Resulting in...

2. Missing the whole idea/point of relational databases--that relations are logical data structures, not physical data structures. Resulting in...

3. Thinking that normalization applies to the physical table structure, rather than the logical relation structure. And...

4. Not realizing that there are only 3 relations being represented by the 6-table representation. Resulting in...

5. A lot of writing about an absurd false dichotomy.

Tichy · on July 15, 2008

Not a db expert, but I can't imagine joining tables is much of a problem. At least in theory, the db should be able to handle the tables internally as if it was only one table. Not sure if all db engines would be good enough at optimizing, but it is probably one of the easier approaches to performance tuning. Therefore I would expect that most db engines already handle that nicely.

I think it is OK to cache some things, like aggregated values (for example how many friends does a user have). But those cached values (or views) I would not consider part of the actual base data model. Even though those values might be stored in the same db, I would consider them to be another layer in a way.

michaelbuckbee · on July 15, 2008

I'm not sure what you mean by "handle the tables internally", but as a really rough exercise, you can think of just raw table reads.

If you have 6 tables that need joined, you're going to need to "look through" each to find the ones with a matching id. If it's all one table it's essentially just one read.

Now, there are many many caveats to all of this (how indexes are setup, relative sizes of tables, etc. but in general it is more computationally expensive to do more joins).

Tichy · on July 15, 2008

In theory the db could keep the separate tables in one table internally. But granted, probably it won't always do so.

olavk · on July 15, 2008

Yeah, this is called a materialized view.

newt0311 · on July 15, 2008

This depends heavily on implementation details. If you are using Oracle (rare, I know for any company not in the Fortune 500), then just make a materialized view and you are set. Oracle will pre-compute most of the data in the view and ensure consistency (<<-- Not an oracle DBA so some confirmation here would be nice). Furthermore, some database engines handle joins and the kind a lot more efficiently than others.

Furthermore, some kind of query caching mechanism will always be a better choice than de-normalization and most decent databases have that inbuilt (and if you really want, you are welcome to add memcached on top of that though that is not guaranteed to be correct). Either way, there are a lot more options than straight out de-normalization.

olavk · on July 15, 2008

SQL Server supports "indexed views" which is basically the same as materialized views.

edw519 · on July 15, 2008

Normalization is not an either/or proposition.

You can have one table, normalize all the way 4th normal form, or stop anywhere in between, depending on the requirements of your app. You can use multi-values within fields to simplify your schema. You can also write your own processes to extract data, without using just pure SQL.

There's more than one way to do things, all with pros and cons. Naturally, in the early phases of your project, you're probably best off insuring the integrity of your data (that is, normalize) and worry about scaling later.

michaelbuckbee · on July 15, 2008

It's not mentioned in the article, but I think the advent of services like Amazon SDB and Google App Engine's datastore (ala BigTable), where traditional joins are not possible is effecting application design.

It's probably a worthwhile exercise to think about how you would model your data if you needed to port it to GAE/AWS SDB, just to figure out where the faults in your original data model were.

neovive · on July 15, 2008

Indexing your heavily used queries and page/view caching are also good alternatives to denormalization.

brianr · on July 15, 2008

Duplicated data and synchronization problems are often overstated and relatively easy to work around with cron jobs.

Yikes, I hope he's not actually doing that!

ltbarcly · on July 15, 2008

Yea, I think this article is just flaunted stupidity.

The author assumes that all times where any user information is needed, that all the information about that user will be needed. That is silly. Secondly, they completely ignores the billion and one problems he will now have.

For example, how do you ensure that the same AIM username is not used on multiple accounts? With the normalized schema, this is trivially done via a unique constraint on the user_id, screen_name, im_service triplet (probably the 3 make up the composite primary key of the table anyway, so viola, consistent data is ensured).

With the de-normalized schema, you are hopelessly fucked. All data consistency would be forced into some kind of ugly trigger.

Finally, joins are really-really-really fast for things like this. Frankly, the real thing that bites you is inconsistent data that violates basic assumptions your application makes, leading to unintended behavior of your application, most often exceptions that are not handled reasonably. In this case, for example, you might have a way to find your friend via AIM login name, but your app blows up when 2 seperate users are returned by the query due to there being 2 results (or you or your orm only get the first result, making finding the actual friend impossible).

And on and on. It is better to have slower fully reliable data than faster bullshit.