Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Definitely, I've run into this problem many times. UUIDs are pretty bulky data types in database terms, usually larger than most other types in a typical table on average and possibly larger than rest of the row after including the index overhead, which creates cache pressure. Logical columns in a table are commonly stored physically as a vector, of UUIDs in this case, expressly for the purpose of compressing the representation. This saves both disk cache and CPU cache. For queries, the benefit of scanning compressed vectors is that it reduces the average number of page faults per query, which is one of the major bottlenecks in databases.

Also, some data models (think graphs) tend to be not much more than giant collections of primary keys. Using the smallest primary key data type that will satisfy the requirements of the data model is a standard performance optimization in databases. It is not uncommon for the UUID that the user sees to be derived from a stored primary key that is a 32-bit or 64-bit integer.



Uuids are bulky if you store them in text form, but in binary form they are only 128 bits.

The main feature of a uuid is it allows distributed generation. 32-bit or 64-bit integers are almost always sequential numbers. The sequential nature allows efficient page filling and index creation, but the contention involved in creating a sequence grows rapidly with scale.

So while a 128-bit uuid is larger than a 64 bit integer, this version allows for the bulk of the benefits of sequential integers while reducing the biggest drawback of contention at the point of creation.


I was assuming binary format. 128-bits is a pretty heavy data type in many data models with measurable impact on performance versus using something smaller.

You also do not need 128-bits to decentralize unique key generation, even though it is quite reasonable if you design your keys well. Many do it with 64-bit unique keys in massive scale systems.

A subtle point that you may be overlooking is that while large-scale distributed databases, including all the ones I work on, export globally unique 128-bit keys, in most systems I’ve worked with they are internally represented and stored as 64-bits or less even if the key space is much larger than 64 bits. There are many different techniques for doing key space elision and inference that are used inside distributed databases to save space. The 128-bit value is only materialized when sending it over the wire to some external client system. But you don’t need to store it.

Literally storing a primary key in a distributed system as a 128-bit value is all downside with few benefits. For small systems the performance and scaling cost may not matter that much but in very large systems it matters a lot. It can — literally! — cost you millions of dollars per year.


You can generate any size number in a distributed fashion. The only difference is that 128-bits gives you enough scale that it's practically impossible to have collisions when randomly generating.

Unless you need to be completely disconnected, a little coordination can drastically improve things. In past companies, I used a simple counter with 64-bit integers and each distributed process would increment a billion-number range to use for IDs. Fast, efficient, compatible with everything, naturally ordered, and guaranteed to never have a collision.


And, to be clear, those benefits are the placement of new records in a DB.

If a UUID is completely random it means it can be inserted anywhere which can require the DB to reshuffle records and pages in order to make room for the new record.

Having a sequential element to the UUID makes it a lot easier to have an index where each record is inserted at the end. Which, like you said, makes page usage more efficient and decreases the amount of work a DB has to do on insertion.

All this is a compounded problem if you have a DB with frequent writes, a lot of indexes, or both.


Twitter snowflakes are unsigned 64-bit ids designed to be created in a distributed fashion.


But they require a separate global distributed HA clustered service just for ID generation. This makes no sense from a cost or complexity perspective unless you’re Twitter-scale.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: