I'm struggling to understand the use case. The article offers "keys in an ordered map" or "binary search" - but the idea of a binary search is that the order of the elements is inherently meaningful, like the floors of a building, weights, rankings, etc. GUIDs are meaningless by design, so there must be some underlying property that you're actually interested in searching. And in that case, why not use that property as the key to the collection?
And usually the database puts them into a btree internally, because that's how tables are stored.
The moment you have any kind of load on such a table, your performance goes to hell. This is because the inserts are going to happen all over the place, and the way tables are stored definitely prefers appends.
So a common word of wisdom is to have an auto-increment primary key, and the uuid indexed! Ugh what a work around.
A better way is to discover ULIDs, which are like UUIDs but with the high-bits including a timestamp. This turns inserts into, approximately, appends. Much nicer!
https://github.com/ulid/spec is fairly recent, but the 'trick' has been used for years. I've build gazillion-row databases where all the data sources have, independently, used this same high-bits-are-timestamp trick. So even though I didn't see it discussed and called ULID until fairly recently, its a very old and well-known trick in some circles.
Once you have ULIDs and you know you have UILDs, a lot of database type tasks become easier because the ID encodes a creation date. I've found myself adding AND id BETWEEN to queries, and using them as a partition key, and using them to DELETE old data etc, and various other admin stuff that the original creators of the table never thought about.
thank you. That's the first decent rational explanation I've heard for why anyone would sort anything by UUID.
I use UUID's for keys, and always sort my results by another field (any other field). If necessary, I add an index for that other field.
I guess I'll have to change this strategy if I get to the kind of scale where the append time matters. I hope I get to that scale, it seems like a good problem to have :)
I also have another metric asstonne of problems I'll need to solve at that scale. But then, premature optimisation is the devil, so I'm happy I'm not dealing with any of that until I need to :)
Why do you use UUID's for keys? Unless you have many data sources which you need to aggregate using a 64-bit autoincremenet key is probably the best solution. 64-bits is enough for almost everything. They can be worth using if you have many data sources, but otherwise not really.
1. I have parallel processes that create records on the same table at the same time.
2. In some cases, I want to create the key in code and send it to the database, rather than fetching the key after the insert.
3. I want to know that if I mess up and send a document_id where I meant to send a user_id and vice versa, it's going to report "no match" rather than create a security breach by exposing one user's records to another user.
Another common reason is that on distributed databases you often don't want your data stored in the same chunk/shard as the last one. You want to spread out your writes.
This of course is not good design if your access patterns involve a lot of sequential reads, so you have to be aware of what you are going to do.
Please see my reply above where I disagree, and please watch for willvarfar's response where he's likely to point out that for his case I'm probably wrong.
I've got experience with 'large' datasets (it's very relative! Mine have been tiddlers in industry terms!) and I'm happy to pass on anything that may help. Even if you don't scale up, smaller data is likely to improve response times, all else being equal. Anyway, feel free to ask.
I've tried a few different key strategies. A UUID per row per table seems to work best. Where I define "best" as "suiting my data better".
I get the thing about deriving keys from data, but it always seems messy to me. I've seen too many weirdnesses in "real" data to be comfortable that there will not be any collisions (I used to play around with a database of every person and their address in the UK. There were several cases of people having the same name, the same birthdate, the same street name, but different towns. Not to mention the hundreds of cases of twins with only an middle initial to tell them apart.)
UUID's might be clunky and a little slower, but at least I will never get a collision, and never cause a problem if I use one key instead of another. I'm totally happy to spend more time optimising my database in exchange for those guarantees.
> Where I define "best" as "suiting my data better".
Can't argue! You're in the best place to decide.
> UUID's [...] but at least I will never get a collision
Hmm, please do some googling for that. I've heard complaints (on the web, not in person) of this happening surprisingly easily. This may not apply for you though. Personally I only trust what I understand, or what I have written guarantees for. Anyway FYI.
ulids and all other attemps to solve this on the taxonomy are either forcing a concept onto your data (what if date is irrelevant?) and gives you false assumptions on distributed systems (just because you can get IDs up to cenrtain date, does it means all the shards had a chance to sync back?)
IMO most data in DBs has a natural key. Why add a fake key? If it's to enforce uniqueness then that uniquness is spurious because it's a GUID which has no meaning to the data to which it's attached eg. you could have the same row 10000000 times but falsely uniquified with a different GUID.
Only exceptions I've encountered to my 'natural key' claim are addresses, which are a sod to canonicalise (in fact, seemingly impossible, I'd welcome any pointers).
Also UUIDs are bloody big keys, so fattening up tables and impairing disk/memory bandwidth. This matters if you have gazillion-row DBs, surely? I dealt with a near-billion row DB and shaving a few bytes off a row would matter to us bigly.
For the sake of argument I'll assert that you should never use UUIDs as keys, but please feel free point out where I'm wrong.
> This is because the inserts are going to happen all over the place, and the way tables are stored definitely prefers appends.
This is opposite to my understanding in that an append to a table causes insertion hotspot which can be bottlenecks for insertion-heavy work. If it's inserting at random places there's no hotspot. I suppose updates scattered all over the btree could cause much more random disk activity though, do you mean that?
> So a common word of wisdom is to have an auto-increment primary key, and the uuid indexed! Ugh what a work around.
Oh god no! Quite agree.
[stuff about ULIDs] I'd really like to understand what you're doing overall. It just sounds just wrong, though I'm sure that's down to me not understanding your use case.
Its easy to create plausible scenarios where things have UUID IDs.
For example, your classic customer database. Each customer has an ID.
Its hard to find a natural key for something as simple as a customer...
If you have a very narrow customer base then perhaps the company has a VAT number or something that seems natural. But get an international customer, or simply have problems getting people to enter those kind of details, gets unworkable fast.
And then I suppose the 'trading name' is another natural identifier, except they can change.
Hell, we've all experienced systems that have used the email address as a natural key, and all the resulting pain as those change.
Just today, in fact, I had trouble with a jetbrains subscription when I used my company's new domain name to register something and it created a new account for me. Ugh. Can't work out how to link them, just because I changed the email address on the pay page instead of the account page. Jetbrains, we expect better of you!
So, anyway, its normal to go with an auto-increment.
But, then, the modern web site you've just built exposes that id to the customer, and you've just leaked how many customers you have!
So modern best practice is to only expose UUIDs in public APIs.
And I'm saying, go with ULIDs instead. Your database will thank you.
I'm going to clarify something; I said most data has a natural key then when presented with this and other examples I seemed to have backed away from that. I hope this makes sense...
In the real world of atoms etc. there's no natural abstraction, including unique IDs (keys). These are all creations of ours.
In my line(s) of work, someone's invented and allocated a key for most items by the time it's reached me - people have SSN/NI numbers, planes will have some identifier, ships have IMOs, machine parts have unique part numbers (when combined with their suppliers' ID), geographical areas have postcodes/zipcodes, and all that.
So for me, someone's done that work already; a 'natural key' is available and I can rely on it. It's always been there, except for a few things such as addresses.
For you, willvarfar, customer numbers have to be created. So I made a wide assertion without thinking it through based just on my experience.
> IMO most data in DBs has a natural key. Why add a fake key?
In our application users generate declarations. On the declarations users need to enter invoice numbers (with amount and date). So a natural key would be a compound key consisting of declaration id and invoice number (which has to be unique per declaration anyway).
Declarations also contain goods items, and they need to be linked to an invoice. No problem there, just make a foreign key pointing to the invoice on the declaration.
However, users sometime type the wrong thing. They finish up a declaration and while double checking notice that they typed the wrong invoice number. So they want to change the invoice number. With the natural key you then need to update all the goods items linked to it as well.
Much easier to just have a "fake" key (autoinc or otherwise) and use that to link the goods items and invoices.
Similar issues crop up all over our application, so unless there's a really good reason we avoid natural keys.
That said, we don't have a single UUID key. We do not have fully disconnected clients, so we use sequence generators for "non-leaf" keys such as declaration id's.
Thanks for a fine example of the discrepancy between my world (the data being canonical, normalised and all-round platonically ideal) and the real world ("but I never meant to do thaaaaat!"). Yup, point taken.
Time sortable UUIDs can be useful when you have many different data sources you aggregate into one table and these data sources all need to assign IDs to the events. By having a 128-bit ID with timestmap upper and random lower bits the risk for collisions is low and by having the first bits be a time stamp you ensure that inserts are quite often done in sequential order, which is good for btrees.
There are other solutions for this problem, and I do not think I would pick UUIDs or ULIDs myself, but it is a decent solution for this specific problem and does not waste much space compared to the other solutions.
Then perhaps store the time as a separate field? Makes them indexable and trivially extracted.
> you ensure that inserts are quite often done in sequential order, which is good for btrees.
I've just explained why that may not be the case under high insertion rates. This from https://www.sqlpassion.at/archive/2014/04/15/an-ever-increas... "With the best practice of an ever-increasing Clustered Key value you have a single hotspot at the end of your Clustered Key. The smaller your records are, the more contention you are introducing here. How can you solve that problem? Easy: spread your INSERT statements across the whole B-Tree structure of the Clustered Index."
@willvarfar:
> Its hard to find a natural key for something as simple as a customer...
OK, completely agree, and you go further - I can't dispute any of that. Only thing I'd say is perhaps don't use a UUID. A 4 byte int will give you plenty of space and use 1/4 of the memory of a GUID. But I may just be splitting hairs. Point accepted!
@heavenlyblue:
> A billion rows require a gigabyte per each extra byte in a row. That's nothing.
I can do the maths, thanks. But if you need real-time responses out a single DB machine serving many users, some heavy, then you have to cache 98% to 100% inside the RAM. Each gig of table read may be an extra 1-5 seconds, depending on disks & setup. Contention increases as more users pile in. Contention increases horribly if queries are greatly slowed by needing to read off the disk. I was fighting for that space to get, and keep, responsiveness.
"buy more machines" is a good response but only after you've optimised to the point of minimal returns. We were a small company, we had to make it work.
I have 200 billion row mysql dbs and have measured the impact of the UUID and ULID keys its forced to deal with. In my high-insert-rate scenarios the extra bytes of these long keys equates to losing about 2000 inserts/sec, just in all the network overhead, compared to if we'd had a 64-bit auto-inc. And then, for storage, I'm using tokudb (when you have tables this big there really isn't any other viable alternative, although myrocks is newer and I haven't evaluated it). Tokudb compresses things really well, and the result is in the tens of terabytes compressed. And the problem with UUIDs and the random part of ULIDs? They are, by definition, uncompressible! Again a lot of that storage footprint is because of the damn things.
Sometimes all you need is a dictionary, where you take a key and access a value. Both hash tables and binary-searchable trees are valid ways to accomplish this, with different speed tradeoffs. Binary search works just fine for this even when the keys are completely meaningless and random.
Databases especially have tons of operations where rows have unique meaningless keys that are used to represent relationships. And at the same time a hash table would mean going offline for hours when it resizes, so they use a tree and apply binary search over it to access rows.
> Both hash tables and binary-searchable trees are valid ways to accomplish this, with different speed tradeoffs.
Isn't a hashtable faster than a binary tree for basically all purposes? The only reason to use a tree is if you want to keep the data in order, but the only reason to do that is if the data has a meaningful order!
> Isn't a hashtable faster than a binary tree for basically all purposes?
They're heavily used in query optimization because it's great to build a hash table no one else is using. It all comes down to resizing the underlying buckets.
If you're doubling each time, that cost amortizes across updates; hence the claim there's a constant cost per update.
But while resizing, you have a choice: make the index inaccessible while it's updating, or fuck it, we'll do it live.
To do it live, you need to implement a fallback strategy that tries both copies. If your new buckets fill up before the first one has transferred, okay, we'll set up a third, but now our job has become that much more disk intensive.
And that's the real problem: on a server, you already have all kinds of processes and jobs running. Extra processes that can chew up hard-to-predict amounts of disk and CPU are more potential causes of instability. Plus, all these jobs need to be recoverable in case of a crash, etc.
Tree structures, OTOH, give you very predictable performance because you're only dealing with log N points on a tree. (B-trees just let you use a very big base for that log.)
A hash table doesn't always have the best constant factors, and often wastes more space.
Also when you're in a situation where you need GUIDs to guarantee unique keys, and also you use a GUID variant with timestamps, many access patterns will go sort of in order and be far faster on a tree than a hash table. Even though the GUID itself is meaningless.
Binary search is a cheap way of getting an O(log n) .contains() without the additional logic of proper AVL/RedBlack trees (or hashmaps). If you need to check whether an element is a member of a set many times, the once-off O(n log n) setup cost is no worse than any other trees (it is worse than hashmaps but those have other downsides).
Not strictly related to this article or your question, but generating ordered UUIDs can sometimes bring some performance improvements (at least in Postgres) on index reads/writes [1].
Similarly for SQL Server sequential UUIDs can reduce fragmentation, which can save some IO, storage space, and memory load, particularly if the UUID used a table's clustering key (most cluster by their PKs by default).
Unfortunately without a bit of a hack NEWSEQUENTIALID() can only be used in TSQL as a default constraint, which can be an irritation.
And you are often better off using something else than a UUID for a table's clustering key, to help common range queries and reduce clustering key size, even if it isn't the PK or otherwise particularly unique.
But also, their ordering doesn't need to be meaningful for data structures to work. It doesn't matter which of the 4-5 ordering showcased there you pick for your GUIDs, the binary search will work more or less the same.
None of these orderings really have any meaning either.
Is there another use case here that I'm missing?