Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

So are we supposed to be designing all our databases to conform to third-normal form? I am not very adept at DB stuff but doesn't that increase the number of lookups needed to retrieve a row of useful data? Is the performance hit from that less painful than storing redundant (perhaps simply for caching) fields in one table?


This is the classic denormalized model vs. the traditional database model. The denormalized model, of storing all data in one lump, is popular with nosql systems and object oriented systems, in that all the data you need for one key is all in one "place". If you know that that will always be your primary access pattern, then yep, it may be faster for you.

However, if you have a variety of ways to get at your data by this key for this situation, or this key for that one, or if you have large amounts of redundant data that you'd like to be updated "all at once", such as what salesperson is assigned to the Northeast region for all customers, then you may find a normalized form is more efficient.

The big learning is YMMV: for years, relational was a religion. Snowflake and other designs showed that there was a middle ground... and nosql has shown that non-relational designs can work well also, esp in many online and large-data situations. The trick is to pick the right one for the expected workload your app will be facing, not to adhere to any religious dogma (3nf, nosql, or anything inbetween)


So a common thing in tech seems to be "oh this new tool will solve everything!". I think this phenomenon is good in the sense that it means that "religious" adherents to a tech will push its boundaries, find out where it is well suited and where it lacks, etc. I think this is just a natural outgrowth of excitement about the new shiny thing -- software people are surprisingly passionate about their craft. Of course this has a tendency to lead to annoying religious wars (sometimes at the level of pointless minutia like semi-colons :) ).

But for me the exciting part is when the fervor dies down a bit, and there is a clear understanding of "opposing" tools' strengths and weaknesses. That is when interesting tech and techniques are developed to handle both based on the strengths.

I think we are starting to get to that point with relational and nosql. Tools are starting to emerge on the relational DB side to allow more nosql-like structures within traditional DBMS systems. I think in the coming year or two we'll start seeing a lot more interesting tools to mix relational and non-relational datastores. Currently the common way of mixing is to use kv or document stores as a caching layer for relational data, or to have relational systems separate from nosql systems, with custom code to select from each. My hope tho is that there will be more uniform access methods developing, allowing the relational and non-relation data to have a common, well understood access API layer.

For example I have a project where the core data consists of users, certain user objects, and relations between users, between users and their or others objects, and between objects, basically shaped like a graph. Using neo4j + gremlin makes dealing with this portion of the app trivial - the graphdb literall solves problems for me. But another aspect of the app is relational data on objects related to user records, basically it is shaped as rows where n objects' rows need to be merged, filtered and then operated on (aka JOINed). SQL makes this set of operations trivial, in this case a DBMS literally solves problems for me. But, as it stands, I need to query one DB, get the object record pointers I need, then connect to a different DB, and query the relevant tables if I want to let the relevant DBs do the work. This involves different APIs, weird connection management, and sysadminning two different database environments.

Essentially I'm hoping someone smarter then me will come up with a way for me to:

   SELECT MY_COLLATE(d1.a, d2.b) FROM
      GREMLIN{find object o1}.data as d1 JOIN
      GREMLIN{find object o2}.data as d2 ON
          d1.col == d2.col
   WHERE conditions


>I think we are starting to get to that point with relational and nosql. Tools are starting to emerge on the relational DB side to allow more nosql-like structures within traditional DBMS systems. I think in the coming year or two we'll start seeing a lot more interesting tools to mix relational and non-relational datastores.

Yep. Postgres showed some pretty strong commitment to key-value datatypes in 2006 by including the hstore contrib module in the mainline tree, and native JSON support is coming in September with 9.2.


That's a process called de-normalisation, covered here: http://en.wikipedia.org/wiki/Denormalisation You should only consider doing that after thorough profiling showing your joins are causing the performance loss.


Premature optimization.

Normalized data is one of the important elements of relational database modeling, and almost all relational databases expect the data to be normalized, so they have optimizations in place to make doing those lookups fast. It will usually create some automatic special indices for the foreign keys or something.

The first thing to do is use a normalized data model and profile it. If it's too slow add some indices and profile it again. If it's still too slow, update the table to store redundant information and profile it again.


Not really. If you are concerned with extra lookups, then you might want consider a materialised view (Or an indexed view in SQL Server). But bare in mind that write performance will be affected, whether this is worth the cost is a tradeoff. Also consider that databases have some very efficient join mechanisms. YMMV though.


I read somewhere long time ago,

"Normalize till hell, denormalize till scale"

Or some thing like that :).


Normalize 'til it hurts, denormalize 'til it works.


Get it correct, then make it fast.


That's it. Normalisation is about correctness.

Every other aspect of software development using RDMSs will attempt to challenge decisions made for correctness' sake.


More important for storage space, quicker updates and faster indexing. If these things are less important(most cases) don't worry.


Cool, thanks :-)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: