Definitely, I've run into this problem many times. UUIDs are pretty bulky data t...

NyxWulf · on Aug 6, 2021

Uuids are bulky if you store them in text form, but in binary form they are only 128 bits.

The main feature of a uuid is it allows distributed generation. 32-bit or 64-bit integers are almost always sequential numbers. The sequential nature allows efficient page filling and index creation, but the contention involved in creating a sequence grows rapidly with scale.

So while a 128-bit uuid is larger than a 64 bit integer, this version allows for the bulk of the benefits of sequential integers while reducing the biggest drawback of contention at the point of creation.

jandrewrogers · on Aug 6, 2021

I was assuming binary format. 128-bits is a pretty heavy data type in many data models with measurable impact on performance versus using something smaller.

You also do not need 128-bits to decentralize unique key generation, even though it is quite reasonable if you design your keys well. Many do it with 64-bit unique keys in massive scale systems.

A subtle point that you may be overlooking is that while large-scale distributed databases, including all the ones I work on, export globally unique 128-bit keys, in most systems I’ve worked with they are internally represented and stored as 64-bits or less even if the key space is much larger than 64 bits. There are many different techniques for doing key space elision and inference that are used inside distributed databases to save space. The 128-bit value is only materialized when sending it over the wire to some external client system. But you don’t need to store it.

Literally storing a primary key in a distributed system as a 128-bit value is all downside with few benefits. For small systems the performance and scaling cost may not matter that much but in very large systems it matters a lot. It can — literally! — cost you millions of dollars per year.

manigandham · on Aug 6, 2021

You can generate any size number in a distributed fashion. The only difference is that 128-bits gives you enough scale that it's practically impossible to have collisions when randomly generating.

Unless you need to be completely disconnected, a little coordination can drastically improve things. In past companies, I used a simple counter with 64-bit integers and each distributed process would increment a billion-number range to use for IDs. Fast, efficient, compatible with everything, naturally ordered, and guaranteed to never have a collision.

cogman10 · on Aug 6, 2021

And, to be clear, those benefits are the placement of new records in a DB.

If a UUID is completely random it means it can be inserted anywhere which can require the DB to reshuffle records and pages in order to make room for the new record.

Having a sequential element to the UUID makes it a lot easier to have an index where each record is inserted at the end. Which, like you said, makes page usage more efficient and decreases the amount of work a DB has to do on insertion.

All this is a compounded problem if you have a DB with frequent writes, a lot of indexes, or both.

krinchan · on Aug 6, 2021

Twitter snowflakes are unsigned 64-bit ids designed to be created in a distributed fashion.

tatersolid · on Aug 12, 2021

But they require a separate global distributed HA clustered service just for ID generation. This makes no sense from a cost or complexity perspective unless you’re Twitter-scale.