Badger: A fast key-value store written purely in Golang (2017)

mrjn · on Nov 10, 2018

(Author of Badger here)

First of all, thanks to the growing Go community for using Badger. Within a short period of time since the original announcement [-1], Badger has become very popular.

A lot has changed since the original post. Despite being tagged as a non-goal at the time of release, Badger now runs concurrent ACID transactions providing serializable snapshot isolation guarantees.

In fact, Badger is being used to serve Petabytes of data (yes, PB, not TB). On my long TODO list to write a blog post about it. It has stood the test of time, being directly used by the community and Dgraph [0]. We have put it through many different crash testing scenarios, including ALICE[1], and run Jepsen style bank transaction tests every night for 8h [2].

There're some common questions around benchmarks on our blog posts. They were done before Badger had transactions and are somewhat old by now. But, Badger's read-write performance is still spectacular. In terms of write throughput, it outperforms RocksDB (called from Go), LevelDB BoltDB, and in terms of read performance, it goes neck-to-neck against B+ tree based DBs, which typically perform better than LSM trees. Recently, there was a benchmark by Ethereum 2.0 folks, which I ran [3]. You can see the ASCIInema here [4].

There's an open bounty of $1337 on Badger to help us find scenarios where Badger could lose data [5]. So, if you got ideas, throw them in!

Thanks again for using Badger!

[-1]: Well, the original link here on HN [0]: https://dgraph.io [1]: https://blog.dgraph.io/post/alice/ [2]: https://github.com/dgraph-io/badger/commit/c10276c9d3b0c4274... [3]: https://github.com/rawfalafel/db-benchmarks/pull/1 [4]: https://asciinema.org/a/Qgyf8C50YRbtH7WpixSaBKDUQ [5]: https://github.com/dgraph-io/badger/issues/601

majidazimi · on Nov 10, 2018

I've read the WisKey paper.

1. Wasn't it simpler to just store the Key/Pointer(Offset) in RocksDB and put the values in the value log? Then the whole database just becomes a simple wrapper on top of RockDB(or whatever K/V database) with bunch of files as value log.

2. Why 16 bytes for value pointer? 8 bytes is sufficient.

3. If LSM size is around 5GB, screw LSM. I would rather use a simple immutable HashTable which gives you consistent snapshot out of the box for persisting to disk. Every 15 minutes launch a thread to serialize the index to disk.

Am I missing something or there is some more detail missing? I mean the reasoning to write a whole database from scratch isn't convincing.

tyingq · on Nov 10, 2018

1. They start off with why they don't want to use Cgo. Not a go expert, so I don't have an opinion on whether that makes sense or not. If the goal is to not use Cgo, they can't use RocksDB.

2. It appears to be 12 bytes. ID, length, offset, all as 4 byte uint32 values: https://github.com/dgraph-io/badger/blob/5242a997f5101ee00c4...

mutatio · on Nov 10, 2018

2. I still believe the 16 bytes stands on x64 (8 byte aligned, thus 4 bytes padding).

majidazimi · on Nov 10, 2018

Then any other database written in pure Go would be sufficient to keep the index.

tyingq · on Nov 10, 2018

Sure. Searching a bit, they did benchmark against BoltDB, but didn't try what you're suggesting. The benchmark I found is also prior to them adding transactions to Badger.

Edit: Actually, the benchmarks turn off sync writes, which makes them silly to me. If you're okay with losing data, you probably wouldn't have chosen a disk based kv store. https://github.com/dgraph-io/badger-bench/blob/master/rw-ben...

The default install has sync writes turned on, so why benchmark it differently?

mrjn · on Nov 10, 2018

As mentioned in my top comment, we have run benchmarks again. SyncWrites on or off, Badger outperforms BoltDB. There's a nice ASCIIrema link that you can see, which is a recording of me running Ethereum 2.0 store benchmarks.

mrjn · on Nov 10, 2018

1. (author) That's how WiscKey did it. But, they omitted a lot of details around how an online value log garbage collection would work. Offline GC doesn't work for most users. On top of that, Badger supports 3-dimensional data access, i.e. key-value-versions. The version axis is supported natively, not overlaid on top of an internal versioning.

Supporting versioning while also doing value log GC is a very tough challenge, something that has taken us many iterations to get right. And there're still improvements that we can make, and users ask us for.

2. Skipping, another comment tackles that.

3. Using a hashtable would not give you sorted iterations, something that many users use very heavily. A 15 min thread idea would work a lot worse than LSM tree, just by definition.

mrjn · on Nov 10, 2018

Follow on 3. Or, at the very least, it is no better than using skiplists in memtables for key insertion (LSM design), and then compacting them away in the tree levels.

gregwebs · on Nov 10, 2018

They use go for their database and didn't want to use cgo due to performance and debugging concerns, so they wrote their own storage engine in go. There is some experimental work going on at PingCAP (makers of TiDB) to bring this concept to RocksDB.

mrjn · on Nov 10, 2018

I've been thinking of writing a paper about Badger in general and in particular, our experiments with perfecting the value log GC while keeping the 3-dimensional access provided by Badger (RocksDB doesn't provide direct iteration over versions).

gregwebs · on Nov 11, 2018

More information on practical implementation would definitely be helpful. Maybe give a talk if that is a lower bar? The RocksDB conference just happened [1], but there might be a different appropriate meetup. TiDB has versions in its RocksDB keys [2]

[1] https://www.meetup.com/RocksDB/photos/29331158/475279758/ [2] https://pingcap.com/blog/2016-11-15-Travelling-Back-in-Time-...

mrjn · on Nov 11, 2018

Ah.. didn't know about the effort at TiDB to put this in RocksDB. They surely can learn from our efforts.

maxpert · on Nov 10, 2018

Have personally made contributions to the project once or twice. I think its a fairly young project so far being used part of bigger graph database. I went back to BoltDB for it’s stability and decent performance. I think everybody has his own scale for measuring fast vs stable. It was fast but not stable enough for my scale.

mrjn · on Nov 10, 2018

Based on the commit you made, looks like there was a build breakage on a 32-bit system. We've resolved all known issues with 32-bit systems. Let us know if you encounter more.

trulyrandom · on Nov 10, 2018

What kind of stability issues did you run into?

auxym · on Nov 10, 2018

Can someone clarify for me what is the use case of these key-value stores getting posted around HN recently?

tschellenbach · on Nov 10, 2018

RocksDB is an awesome building block for building your own database. Now usually you don't need that. But if you have a ton of traffic, moving 1 abstraction level lower gives you a lot of room for optimization. Stream's in-house db is for instance using RocksDB. https://stackshare.io/stream/stream-and-go-news-feeds-for-ov...

Mixpanel, DataDog, Linkedin and Pinterest also created their own DB tech. Linkedin and Pinterest use RocksDB. Instagram lately started doing something similar by using RocksDB as the storage layer for their Cassandra cluster.

tyingq · on Nov 10, 2018

They are often used as the base storage layer for more complex things.

CockroachDB has RocksDB at the bottom: https://www.cockroachlabs.com/docs/stable/architecture/stora...

Etcd uses a fork of BoltDB: https://github.com/etcd-io/bbolt

billsmithaustin · on Nov 10, 2018

Indeed uses an LSM tree based key value store for its data about job postings: https://engineering.indeedblog.com/blog/2013/10/serving-over...

folex · on Nov 10, 2018

Is it possible to make Badger completely in-memory? I.e., disable persistence.

mrjn · on Nov 10, 2018

You can keep both the value log and LSM tree in memory, exposed by badger.Options. That'd still give you persistence and crash-resilience while allowing your reads to benefit from direct memory access. If persistence is a non-goal, yeah, tmpfs would work.

If you use it as an in-memory DB, then I'd recommend using the LSM options, instead of Default options.

shazow · on Nov 10, 2018

Related issue discussion: https://github.com/dgraph-io/badger/issues/377

captn3m0 · on Nov 10, 2018

How about using tmpfs?

amelius · on Nov 10, 2018

You might as well use a simple hash-table then.

It seems that most of the design effort of this project was focused on fast disk access.

grumpydba · on Nov 10, 2018

Fast disk access on linux means async io+direct io, which badger does not use.

Files are essentially mmap'd.

Regarding the hash table suggestion, you would lose compression/compaction and indexing/prefix scans.

mrjn · on Nov 10, 2018

Mmaping is optional. Badger doesn't require it, it gives you that option. Async IO in Go has been debated vigorously over the years, but there're little benefits in building that library. Because Goroutines do an equivalent job. Which is what we're doing, random reads (for value log access) spread over many goroutines.

grumpydba · on Nov 10, 2018

My bad, I just glanced at the code. Are you using O_DIRECT?

Goroutines do an equivalent job but write is synchronous and thus blocks one thread of the go scheduler, as far as I can tell.

Also the benefit of the async io API is that it allows sending multiple requests in ones call. Syscalls in go do have an overhead which can be mitigated this way, I guess.

shirakawasuna · on Nov 10, 2018

You'd also lose notions of transactions with a simple hash map - it can be nice to have some built-in locking / ACID-ish features, even in-memory (I use SQLite with :memory: for just that purpose sometimes).

amelius · on Nov 10, 2018

> Files are essentially mmap'd.

Yes, but using LSM tree has a performance penalty over plain hash tables.

Yes, you would lose compression. But you could still have prefix scans by replacing the hash table by e.g. a red/black tree.

herf · on Nov 10, 2018

Would definitely like to see the GC writeup. Are there patterns where replacing or expanding values use lots of storage or where things slow down waiting for GC?

mrjn · on Nov 10, 2018

Lack of value log GC won't affect performance directly, but every write would write to the value log GC, hence increasing its size. So, a GC is needed to deal with the growing size.