Why CouchDB?

jchrisa · on Dec 10, 2008

CouchDB is different from a relational DB in so many ways its almost silly to be doing the comparison. However, many people use and understand SQL, so we must show them what differences to expect.

Your argument about normalization being more realistic has its place. CouchDB models documents, which by necessity are somewhat complete, even when taken from their original context. Document modeling is radically different from relational modeling.

In document modeling, more emphasis is placed on the document life-cycle as opposed to inter-record normalization (relations). And in document modeling the client has a greater responsibility for saving data that is useful to the user, rather than asking the relational store to reconstruct the objects.

I think there are a lot of uses for CouchDB alongside relational DBs, but I'm especially excited to see what kinds of crazy things people figure out how to do with p2p offline replication.

chime · on Dec 10, 2008

I've read tons of documents, articles, and how-tos on schema-free non-RDBMS DBs like SimpleDB, CouchDB, BigTable, DBSlayer etc. by now and I agree they seem very interesting and powerful. However, I just don't see how having each document (e.g. invoice) have it's own field structure would be a good thing. Every database I've ever worked on is structured and every new document has almost always the same data as every other document. That is not because of the limitations of the database but rather the nature of the business. Every invoice should have the same exact fields as every other invoice because employees are trained to fill and understand the implications of the data in each field. This has a lot more to do with business process efficiency and nothing to do with computers. RDBMS just facilitate that business rigidity more naturally. Of course, nobody's saying I can't do this in CouchDB but then why use CouchDB if I don't need the fluid document structure?

Also, the biggest question I have is, how powerful are the dynamic views? I have tables with 10m rows and often serve 100+ select queries per second with joins/where/having/order/group-by/full-text-search functions. The business goal is not to make group-by queries but to summarize data in real-time as the analysts want e.g. sales by product per day or shipments by employee per week. How fast would a view that performs an equivalent function work on CouchDB? Can I optimize it somehow because I don't see much mention of index colums anywhere?

What bugs me most is that when dealing with databases like CouchDB and SimpleDB, the "best practice" advice I hear is to just write summary data and search indexes on my own. That is every time a new invoice is generated, add +1 to the summary.count.invoice record, add +300 to the balance.customer.acme-corp record and -5 to inventory.widget record. And don't forget to add every unique word in the document to the search index table. While that would be great if I knew the exact business needs ahead of the time but in reality, first you capture all the data you can capture and then generate reports based on the needs of the user.

I fully understand all the wonderful things that CouchDB and SimpleDB provide like statelessness, replication, caching, and scalability. I know they are not SQL databases and will not offer drop-in replacements. However, I just don't see how I can use them for any of the hundreds of database projects I've been involved in over past decade and a half as easily as any decent SQL server.

Here's an article I want to read someday. "How CouchDB will be a better solution when you try to do X" where X is something real-world database folks do on a regular basis and X is not a blog engine, recipe manager, or address book. I have tens of GB of data that I'd rather store in SimpleDB than host myself on MySQL or Postgres. However, what I make from all of these document-oriented databases is that they're wonderful, they will solve all of my scalability and concurrency problems, and they will require me to reimplement all of the necessary grouping and indexing features of a typical RDBMS myself.

bitdiddle · on Dec 10, 2008

business evolves. This is why often the focus is on process rather than objects. In reality there are no real classes of things, just instances, each distinct. The key word you mention is "structured". Sure, if you're building apps to support the DMV where documents and forms change once every 30 years and people work for life sitting there doing data entry on screens that look like those forms then it's a safe bet you can design a relational schema that supports this. The same holds for banking and other transactional settings. The relational approach is powerful, with an algebra to back it up and support a query language.

So where do you put the logic, business rules, integrity constraints that govern this data? In stored procedures, key constraints? Or does it leak into the application code, especially if the code is OO. Using OOP how do the objects map to the tables? Some high level abstraction like hibernate that presumes to make that declarative and automatic? Now what if the schema is not so static, suppose it's dynamic. How does it evolve?

I think what many have recognized is that there is considerable overlap in the db world with app servers and web servers. The web is a much more dynamic place and increasingly I think relational databases are used more for just simple storage.

We've all used RDBMS for years and they've had loads of PhD theses put into making them what they are. Sometimes a different view is helpful. CouchDB has a dirt simple REST based API. It uses JSON for communication and Javascript in the database. It lets you store almost anything you want and it supports replication. For a real world scenario think of say something like a web-based lotus notes that supports on and off line use and collaboration.

Another very interesting aspect of CouchDB is it's choice of implementation language, Erlang. What it inherits from OTP is readily seen in the small amount of code needed to implement it. Moreover when it comes to robustness Erlang is frightening in how rock solid it is.

For what it's worth I think it's a keeper

janl · on Dec 10, 2008

Thanks for taking the time to respond in such detail!

> I've read tons of documents, articles, and how-tos on schema-free non-RDBMS DBs like SimpleDB, CouchDB, BigTable, DBSlayer etc. by now and I agree they seem very interesting and powerful. However, I just don't see how having each document (e.g. invoice) have it's own field structure would be a good thing. Every database I've ever worked on is structured and every new document has almost always the same data as every other document. That is not because of the limitations of the database but rather the nature of the business. Every invoice should have the same exact fields as every other invoice because employees are trained to fill and understand the implications of the data in each field. This has a lot more to do with business process efficiency and nothing to do with computers. RDBMS just facilitate that business rigidity more naturally. Of course, nobody's saying I can't do this in CouchDB but then why use CouchDB if I don't need the fluid document structure?

It is not useful if every document has its own structure :) But being able to model subtle differences like e.g. fax numbers in address records (basically that is a NULL-column in a relational table).

> Also, the biggest question I have is, how powerful are the dynamic views? I have tables with 10m rows and often serve 100+ select queries per second with joins/where/having/order/group-by/full-text-search functions. The business goal is not to make group-by queries but to summarize data in real-time as the analysts want e.g. sales by product per day or shipments by employee per week. How fast would a view that performs an equivalent function work on CouchDB? Can I optimize it somehow because I don't see much mention of index colums anywhere?

There are two kinds of views in CouchDB, temporary and permanent views. Temporary views should only be used in development on small amounts of data. Permanent views can do the exact same things temporary views can do, except their results are cached. For dynamic lookups you use the view key range that you can query in numerous ways as well as reduce that lets you do all sorts of computations. Something like average/total/max spending over a dynamic period of time (last week, last month, last year) can be generated from a single view in O(log n) time (where n is the number of documents). Concurrent reads happen non-blocking. I have seen a couple of hundred req/s on a view running on a dated MacBook.

> What bugs me most is that when dealing with databases like CouchDB and SimpleDB, the "best practice" advice I hear is to just write summary data and search indexes on my own. That is every time a new invoice is generated, add +1 to the summary.count.invoice record, add +300 to the balance.customer.acme-corp record and -5 to inventory.widget record. And don't forget to add every unique word in the document to the search index table. While that would be great if I knew the exact business needs ahead of the time but in reality, first you capture all the data you can capture and then generate reports based on the needs of the user.

I can't speak for SimpleDB, but in CouchDB the housekeeping is done for you automatically if you created a view for all your stats & things. You can even notify external services (fulltext search) to index new or changed data).

shaunxcode · on Dec 10, 2008

I can see the point in terms of avoiding table locking etc. but I still don't see it (not that it is pretending to be) as a silver bullet - surely you would still want to maintain your relational data structure and then have it write to couch db when appropriate for read access, almost like a denormalized view onto your relational data.

janl · on Dec 10, 2008

CouchDB is no silver bullet. I hope we never give the impression that we think it is.

On your point: You can still manage relations in CouchDB, they are just implicit.

illumen · on Dec 10, 2008

It's not installed, and code I wrote for it a year ago does not work anymore.

illumen · on Dec 10, 2008

To explain my comment...

1. Couchdb is not installed and available everywhere like mysql.

2. The API is not stable (they haven't reached 1.0 yet), so code you wrote for it a year ago does not work on versions today. Code you write for it today will likely not work in two years time.

This is 'why not couch db' for me.

nslater · on Dec 10, 2008

If you need to wait for wide adoption in the hosted server market place, this seems like a sensible concern. You can install CouchDB by hand of course, it's really not that hard.

The API changes occasionally, as is natural. The project is improving as we get more popular and people suggest new and better ways of doing things. This is part of having a healthy free software community. It's important to point out that the API is so simple it's crazy, and it changes very infrequently - so it's really not much of an issue.

janl · on Dec 10, 2008

These are fair points, but what do you expect from a project in alpha state not nearly half the age of MySQL? :)

pushcx · on Dec 10, 2008

If they're trying to sell themselves as a reasonable alternative (let alone superior), yes, I do.

nslater · on Dec 10, 2008

Wait, what? Where does it say we're an alternative to relational databases, or MySQL?

illumen · on Dec 10, 2008

The whole article is a comparison to relational databases.

nslater · on Dec 10, 2008

Nope, it's really not meant to be. But it is intended as a primer for those coming from a relational database background. CouchDB is NOT a replacement or an alternative to relational databases. CouchDB is fundamentally a different way to model data. That may work for you, it may not. CouchDB is not a panacea.

caudicus · on Dec 10, 2008

I think they're trying to say how they differ from relational databases since most people know and understand relational databases. It's like telling someone in America how cricket is played and using baseball as a base point to explain it.

illumen · on Dec 10, 2008

Yes. Saying how something differs is called a comparison.

caudicus · on Dec 10, 2008

In response to "Wait, what? Where does it say we're an alternative to relational databases, or MySQL?" you responded: "The whole article is a comparison to relational databases."

Sure sounded to me like that was an answer to his question, and you were saying the author is saying CouchDB is an alternative to relational databases by saying it is a comparison.

I obviously wasn't trying to define what a comparison is, I was just giving a reason FOR this particular comparison.

You need not be so snide, sir.

janl · on Dec 10, 2008

The whole article is the first chapter of a book scheduled to be published in summer 2009. We are writing about the future here :)

rantfoil · on Dec 10, 2008

Does anyone know of large production websites that use CouchDB extensively?

nslater · on Dec 10, 2008

http://wiki.apache.org/couchdb/CouchDB_in_the_wild

rantfoil · on Dec 10, 2008

Hm, so I guess the answer is mostly no. =/

The biggest site on there is wego.com. Great site, and I think they open sourced much of their Rails CouchDB integration. But only 30k unique visits per month.

nslater · on Dec 10, 2008

It's still very early days. If you want do do a very large production setup of CouchDB, you will be sitting on the bleeding edge of community experience. That's not for everyone, of course.

newt0311 · on Dec 10, 2008

One point from the article: RDBMSs do not represent data in the same way as in the real world.

This is a horrible misconception. In fact, CouchDB et al. store data like we do in the real world but they certainly do not represent data as it is in the real world. That crown goes to RDBMSs. Suppose we have two physical records of a company's logo. Then the company logo changes but only one of the records is updated. That kind of error took place because the physical storage model (two places) did not correspond to the data model (one entity). That is where RDBMSs shine and CouchDB fails. RDBMSs allow data relations to be expressed and then have the DB enforce these relations. This incurrs some overhead (and I suspect, makes most web programmers think that RDBMSs are "old school") but in most cases, this overhead is paid for several times over in future data integrity.

jchrisa · on Dec 10, 2008

oops, commented when I meant to reply. Here goes:

CouchDB is different from a relational DB in so many ways its almost silly to be doing the comparison. However, many people use and understand SQL, so we must show them what differences to expect.

Your argument about normalization being more realistic has its place. CouchDB models documents, which by necessity are somewhat complete, even when taken from their original context. Document modeling is radically different from relational modeling.

In document modeling, more emphasis is placed on the document life-cycle as opposed to inter-record normalization (relations). And in document modeling the client has a greater responsibility for saving data that is useful to the user, rather than asking the relational store to reconstruct the objects.

I think there are a lot of uses for CouchDB alongside relational DBs, but I'm especially excited to see what kinds of crazy things people figure out how to do with p2p offline replication.

bayareaguy · on Dec 10, 2008

I'm especially excited to see what kinds of crazy things people figure out how to do with p2p offline replication

http://en.wikipedia.org/wiki/IBM_Lotus_Notes