Choosing a Postgres primary key

traceroute66 · on Jan 20, 2023

Honestly that's a poor blog post.

Randomly concludes "the best time-based ID seems to be xid" without saying why or comparing to others e.g. ksuid, UUIDv7 etc ("xid" is only mentioned twice in the entire blog, first in the above statement and second a link to the reference implementation). Equally unfortunate that they picked "xid" as their supposed "best" because Postgres has an internal identifier that is also called "xid" and is very much NOT to be used as a primary key !

Downplays the many issues with "serial", including somehow thinking the word might "might" has a place next to the words "not want to expose them to the world though" .... you DON'T, full stop. Exposing predictable identifiers to the world is never a good thing.

I'm not really sure what that blog post is supposed to be achieving really. I didn't learn anything.

onphonenow · on Jan 20, 2023

I really hate this trend away from basic IDs. I feel like it's driven by folks who've never actually worked in the real world. I got account paperwork recently where the company ID account ID and invoice ID were all uuids. 100% this company if I call them will not use this BS to lookup my account and will instead use something easier try to guess like a company phone number.

I also had to do some support tickets recently and because of the issues copying uuids from the screen they insisted on screenshots that included the url bar showing uuid. 100% trying to tell them the ID over try the phone or even rekey would run into issues.

Give me back my sequential invoice ids!

spookthesunset · on Jan 20, 2023

At some point those sequential integer ID's will become so long they might as well be GUIDs. Of course you could "compress" the integer using some kind of encoding scheme similar to base64. Then your long integer becomes a few characters.

... Maybe though. Even ID's in the millions are probably easier to read than a long ass GUID.

yellowapple · on Jan 21, 2023

> At some point those sequential integer ID's will become so long they might as well be GUIDs

Worth bearing in mind that an unsigned 32-bit integer sequence can uniquely identify 4,294,967,295 records. If you're really storing that many records (let alone enough to exhaust an even bigger integer sequence), the length of the identifier is probably the least of your worries :)

meekaaku · on Jan 20, 2023

Standard accounting practices require that invoices/receipts have sequential numbers. So it has to be integer, or some kid of sequential alphanumeric.

Even if one uses UUID as primary key, it should have separate invoice_no (in whatever formatting they require such as 2023/001, or 1001,1002...) which is the human readable and referenced number.

This is especially important if you are developing a multi-tenant system where the invoice number, say 2023/001, may exist for more than one tenant.

victor106 · on Jan 20, 2023

Are sequential id’s a security risk?

In one of our systems we’ve seen customer guess at other accounts by just incrementing the sequence.

The rule of thumb I used to use is if an Id is going to be used for lookups or being exposed externally use uuid otherwise us sequential.

The hard thing about the above rule is that it’s hard to tell when you are designing the db if the id will be used externally/for lookups or not. Requirements change later.

hnarn · on Jan 20, 2023

"Protecting" records by making IDs hard-to-guess just seems like putting the responsibility in the wrong place. If you're so worried about people getting their hands on the wrong records, I'd be more worried about your lack of trust in the application that queries that database in the first place: remember, even if you do make the IDs hard to guess, your "untrusted application" might at some point decide to simply leak the entire table instead.

There are much better solutions for this, like server-side prepared queries that do not simply return on an ID but rather as the result of a join, or just proper security practices in general, rather than reactively making something that was once guessed simply harder to guess.

Also, something like an ID is in my experience something that will be referred to orally between two human beings when discussing a problem, like referring to user 6201 or discussing invoice 540567. When you switch to UUIDs you're basically also saying "no human will ever have to say this out loud".

insanitybit · on Jan 20, 2023

> "Protecting" records by making IDs hard-to-guess just seems like putting the responsibility in the wrong place.

It's a pretty powerful implementation of capability based security.

yellowapple · on Jan 21, 2023

I would hope that a capability-based security system entails considerably more than just knowledge of an ID.

yencabulator · on Jan 21, 2023

Many actual capability systems explicitly construct their capabilities to be unforgeable by not making them be mere secret byte strings, but objects protected by the kernel or other environment. For example, https://www.cs.cornell.edu/courses/cs513/2000SP/L08.html

And then people naively copy operating system design methodology into web APIs... Unrevokable leakable know-it-and-you're-in secrets are a bad idea.

ASalazarMX · on Jan 20, 2023

> In one of our systems we’ve seen customer guess at other accounts by just incrementing the sequence.

What can they do with that guess? A sensible access control would not even let them see if the account is active, cancelled, or invalid.

oslac · on Jan 20, 2023

UUID does not protect your records, that is, it is not a security measure against what you describe.

adamckay · on Jan 20, 2023

UUID is a part of the protection. If your access controls fail then preventing an attacker from enumerating through data by incrementing an integer is protection from Insecure Direct Object Reference. [1]

Even if there are no vulnerabilities in access controls it can also prevent competitors from knowing how busy your platform is. If I register a new account on your system and I get ID 57854 and a week later I register another and get ID 57978 then I know 124 new users have signed up in that time.

1 - https://cheatsheetseries.owasp.org/cheatsheets/Insecure_Dire...

victor106 · on Jan 21, 2023

Thank you so much for this.

insanitybit · on Jan 20, 2023

It is if you don't expose a 'list' api, only a 'get'.

perakojotgenije · on Jan 20, 2023

integer/year works almost flawlessly as invoice ID. It is unique and readable.

If the company has more than one unit that creates invoices than just add unit ID, so it becomes invoice_id/unit_id/year. Still unique and very much readable.

yencabulator · on Jan 21, 2023

To avoid the German Tank Problem and leaking your business vitals to competition, I'd recommend making invoices be scoped to the client, e.g. <client_id>-<number>. And then client_id is either random or for small business e.g. unique slug derived from client name, for example you could get PERAK-0001.

https://en.wikipedia.org/wiki/German_tank_problem#Historical...

brightball · on Jan 20, 2023

This is a very worthwhile point. You definitely need to prepare for readability if these ids will be customer facing.

craigkerstiens · on Jan 20, 2023

While not a perfect solution to it, we use UUIDs throughout our application as IDs. But when we expose them in URLs or other places provide them as an encoded id (https://docs.crunchybridge.com/api-concepts/eid/). It at least makes them a little more compact, easier to copy/paste than UUIDs, and generally a cleaner look.

mytailorisrich · on Jan 20, 2023

As it happens in at least some jurisdictions invoice numbers have to be sequential by law.

Scarbutt · on Jan 20, 2023

That's orthogonal to having a programatic ID for them

ldjb · on Jan 20, 2023

Collaborative databases (Wikidata, TheMovieDB, VNDB, etc.) all use serial identifiers. What is the problem with this? These websites don't want to hide how many entries they have (they tend to promote them), and it doesn't really matter if you iterate through all the numbers – the data is available through open licences anyway.

I think there are many situations where you don't want to expose predictable identifiers, but there are also examples where predictable identifiers may actually be beneficial.

aniforprez · on Jan 20, 2023

If your data is public and can be scraped anyway, obviously a serial identifier doesn't matter. If your data is sequestered between accounts or has tenants all sharing the same database or API, that's just one accidental permission error away from being able to scrape every single record. If your customer's data is meant to be private, simply easier to generate unique IDs for each record. Plus it makes it incredibly easy for competitors to see the size of your business simply by signing up for an account and looking at the IDs.

indymike · on Jan 20, 2023

> What is the problem with this?

It causes meetings with people who think all predictable identifiers are a problem.

jeremyjh · on Jan 20, 2023

Yes to me xid has some pretty significant downsides. It is composed of a timestamp, a machine identifier, a process id, and then a sequence that starts from a random value. It can be a bit faster because it doesn't have to generate any random values to make a new ID, but I don't see how it would ever work client-side. I much prefer the simplicity of UUIDv6 and ULID, and they both perform nearly as well. UUIDv6 basically is ULID but without the standard base32 representation, which I think is unfortunate.

osigurdson · on Jan 20, 2023

>> Exposing predictable identifiers to the world is never a good thing.

Sometimes it doesn't matter. Example below:

https://news.ycombinator.com/item?id=34451344

traceroute66 · on Jan 20, 2023

> Sometimes it doesn't matter. Example below:

There is a saying for the examples you and others are posting ...."The exception rather than the rule"

Posting contrived examples in order to attempt to prove a point. For the majority of cases, a random ID remains the better option.

But unfortunately developers still treat security as an afterthought. They continue to use "serial" because of what can only be described as sheer ignorance, i.e. demonstrably false misunderstandings about database technology.

Case in point, I placed an order on an e-commerce site a couple of weeks ago. I was not best pleased to be given a tracking URL that read: "https://example.com/order/WEB-nnnnn", where nnnn was clearly an incrementing number. Such a thing is inexcusable in 2023 !

It took a lot of strength to resist the temptation !

philliphaydon · on Jan 20, 2023

Changing the nnnnn should not magically give you an invoice.

If your argument That a guid is better used here then you’re wrong. If the endpoint is not secure it doesn’t matter if your used a sequential id or a random id/guid. You could brute force a discovery.

You should always validate the input and verify the accessed invoice belongs to the person requesting it.

osigurdson · on Jan 20, 2023

For sensitive information it should not be trivially guessable. Naturally, we need encryption and auth, but using not guessable identifiers multiplies the (already small) probability of successful attack by 1/2^128 so it is a good idea.

philliphaydon · on Jan 20, 2023

There's always going to be valid reasons to have non-guessable identifiers. But it shouldn't be used for security. It's not a replacement for not checking resource access.

Sometimes it's not 'sensitive', and even when it is, it doesn't really matter.

An invoice id, doesn't matter. You need to be logged in to access it, and when logged in, you can only access your own invoice. If you're going to try discover other invoices, well you know the user who is trying to hit random urls to find an invoice that doesn't belong to him. If you want to prevent random signups from doing it, block users who don't have => 1 paid order from accessing the page entirely.

Claiming "Exposing predictable identifiers to the world is never a good thing." tho is just FUD. There are use cases. But it's not 'never' a good thing.

osigurdson · on Jan 20, 2023

>> It's not a replacement for not checking resource access. It isn't a replacement but it does compliment security. Imagine that your service accidentally hands out the wrong jwt for a short period of time. The attacker has the jwt and a known url but not the url to access another user's information. It will take the attacker a long time to find the url if it is a 128bit uuid. By that time the jwt has expired. I'm not suggesting that this will always work but it does reduce the probability of a successful attack - and security is 100% about probability.

>> Claiming "Exposing predictable identifiers to the world is never a good thing." tho is just FUD

Fully agree, this post is a great example: I can increment the id in the url and see the next post - no harm done.

spookthesunset · on Jan 20, 2023

> It's not a replacement for not checking resource access.

If I had a hundred bucks every time I've seen a website fail to do this access check... I'd be a thousand-aire.

thefreeman · on Jan 20, 2023

Look i’m not taking a stance on this argument in general, but your statement that you can brute force a guid seems misguided. If you can brute force a guid url parameter you might as well just brute force the guid session token.

arp242 · on Jan 20, 2023

Just write a test case that "user 2" can't access "/order/WEB-nnnn" from "user 1" and that works too. You should have this test case anyway, even with random IDs. They can provide an extra "defence in depth" bonus, but they're of course no replacement for authentication checks.

You're going to need some type of readable relatively small ID anyway, because things like "Hi there, I have a question about order b1a354c5-ac2b-4990-a189-1f2b4f537b09 I placed on your website" doesn't really work.

bmn__ · on Jan 20, 2023

The canonical solution for that is to encode the ID appropriately. Your example with common choices:

as base64url (RFC 4648 §5): saNUxawrSZChiR8rT1N7CQ (22 octets)

as base85 (RFC 1924 §4): p&bTO@+boru-j3)#beDJ (20 octets)

as QR code: data:text/plain;charset=utf-8;base64,4paI4paA4paA4paA4paA4paA4paIIOKWiOKWhOKWiCDilogg4paIDQrilogg4paI4paI4paIIOKWiCAg4paE4paA4paI4paA4paI4paADQrilogg4paA4paA4paAIOKWiCAg4paI4paI4paA4paA4paI4paIDQriloDiloDiloDiloDiloDiloDiloAg4paA4paA4paE4paA4paIIOKWgA0K4paA4paI4paA4paA4paAIOKWhOKWiOKWhCDiloDiloDiloTiloDiloQNCuKWgCDiloDilogg4paI4paAICDiloQgIOKWgOKWiOKWhA0K4paA4paA4paI4paA4paIIOKWiOKWhOKWhOKWiCDilojiloDiloDiloANCuKWgOKWgOKWgCDiloDiloDiloAg4paA4paAICDiloAg4paA

arp242 · on Jan 20, 2023

The problem with base64 is that it's still long and ugly, and also case-sensitive. I wouldn't want to say "order saNUxawrSZChiR8rT1N7CQ" over the telephone. Is it better than "order b1a354c5-ac2b-4990-a189-1f2b4f537b09"? Practically speaking, not really: both only really work when copy/pasted on a computer.

I described another scheme I've used in the past in another comment: https://news.ycombinator.com/item?id=34454430

osigurdson · on Jan 20, 2023

My example is not contrived, it is literally this exact post (and HN isn't the only site with publicly available information).

However, I agree with you - sensitive information should not be easily guessable even if other security mechanisms are in place (and they absolutely should be of course).

What I am really reacting to is rules of thumb - "always do X and you will be ok". The issue is, it is often easier (and far more valuable) to understand the real reason behind things than to remember the rule of thumb.

spookthesunset · on Jan 20, 2023

> For the majority of cases, a random ID remains the better option.

If your ID is going to be exposed to a human you have to admit that random UUID's are kinda clumsy.

One other pro to UUID's that I don't see discussed is you can generate them client side and be 99.9999999% assured that they won't collide when stored in the DB. I kinda forget our use case for this but I think it was to make "create" and "update" REST calls a lot easier.

vlovich123 · on Jan 20, 2023

But if you’re handing out a random ID, you still need to have an index over that ID to have efficient lookups of random ID -> internal ID right?

One advantage is that you only need to resolve it once at the edge and then internally you use the external facing value. Are there any others I’m missing?

I agree, I would have loved a deeper dive in xid since it seemed to clearly outperform everyone else.

lbriner · on Jan 20, 2023

Not necessarily. The idea here is that the id can be exposed publically because it is random. The problem it solves is someone sees a page /accountdetails?id=123 and can easily look for /accountdetails?id=124 and assume it is likely to be valid.

If you use a random id, you cannot quickly know what other ids exist which makes looking for unauthorized access to objectids much harder.

barrkel · on Jan 20, 2023

If there needs to be an auth check for other IDs, then it shouldn't matter whether the IDs are random or not.

Depending on how highly aesthetic URLs are valued, it's not unlikely that after being in business for a while, the density of your keyspace will mean that even random IDs are found.

A better rationale for disconnecting public and private IDs is to make certain types of database migration a little bit easier. I don't think it's a huge win though, it's a chunk of work to maintain multiple IDs, do the required inderictions all the time, ensure public IDs are always sent to front end, and so on, while adding a translation layer as part of a database migration delays this kind of work until it's actually needed, if ever.

nevir · on Jan 20, 2023

https://en.m.wikipedia.org/wiki/German_tank_problem

Preventing competitors from estimating the size of your business (or of your customer's businesses, if generating sequential IDs on their behalf) is one big reason for having unguessable public IDs.

michaelmior · on Jan 20, 2023

This of course assumes that all the records backed by these IDs are not public information. In a lot of cases, this may not be true.

mytailorisrich · on Jan 20, 2023

Unauthorized access is dealt with by implementing correct authorization procedures, not by obfuscating IDs, which valid ones can still be discovered through brute force or other means.

In your example those simple, sequential IDs should have no impact on security.

CPLX · on Jan 20, 2023

You really care about exposing a serial number scheme for a list of books your company publishes, or the identifier for each of the various hotels you own, or the cities you have an office in or something?

quesera · on Jan 20, 2023

It's not only about predictability.

Exposing a serial number can also be competitive intelligence. Number of users, transactions, etc.

Sometimes this matters.

CPLX · on Jan 20, 2023

Yes of course. Which is why I picked a few examples where it doesn’t matter. Stuff you’d just always have a public catalog of.

nevir · on Jan 20, 2023

Absolutely (once your business reaches a certain scale).

It can give competitors the ability to estimate the size/success/etc of various aspects of your business.

This was a major motivation for a certain online retailer to generate non-sequential IDs.

Some other interesting examples in an old HN thread: https://news.ycombinator.com/item?id=7278198

michaelmior · on Jan 20, 2023

Probably not, but you definitely should care about not issuing sequential credit card numbers. Probably not the best example, but I don't think it's hard to imagine some scenario in between the two that still presents a concern.

seanhunter · on Jan 20, 2023

You can't just use sequential numbers for credit cards because they have a builtin checksum validation called "Luhn's algorithm"

https://www.creditcardvalidator.org/articles/luhn-algorithm

michaelmior · on Jan 20, 2023

Sure, that wasn't really my point though. My point was that there are some cases where sequential identifiers would cause security concerns and some cases where they would not.

ammmir · on Jan 20, 2023

While this is a good overview of the options for primary key generation, there's no silver bullet here. Most projects that are using SQL should just use the gold standard: an auto-incrementing integer for an internal primary key. And then decouple the public-facing primary key from it into a separate column, whether it be ULID, UUID, or a random-project-slug-123.

Also, during debugging, it's a lot nicer to look at short primary keys than to have UUIDs flooding the screen.

cameronh90 · on Jan 20, 2023

Why would you use an integer primary key and a public facing UUID? That seems like it's the worst of both worlds: ugly externally visible identifiers, record bloat, a database that you can't easily merge in the event of backups or DR, and having to roundtrip to the DB before you know the ID of a record.

I personally stick to UUIDs in pretty much all cases, with the exception of where there are justified and benchmarked performance reasons not to.

dspillett · on Jan 20, 2023

> That seems like it's the worst of both worlds: ugly externally visible identifiers

Sometimes the external identifier is needed due to interaction with external systems, so it isn't really your choice as the DB/app designer.

>, record bloat,

Depending on the DB, the opposite can be true. In SQL Server if the integer key is the clustering key, which is usually the case for a table's primary key then you may get a smaller DB then using a UUID alone because the clustering key is included in all non-clustered indexes on the table (so the 4 or 8 bytes saved by just having the UUID and not also an integer ID is quickly lost to a larger amount of bloat).

> and having to roundtrip to the DB before you know the ID of a record.

For internal use you should just use the integer ID for the most part, the UUID or similar being for external references.

Unless of course the UUID is for security purposes (making enumerating records impractical for instance) in which case you'll be using the UUID in your own application directly. In these case you shouldn't look up the ID, just query by the UUID. Usually you are wanting to access the base record anyway so that isn't an extra JOIN, and if you are just referring to child tables (so wouldn't need to reference the main entity if not looking up the internal ID) the extra JOIN is usually insignificant, pulling back a single row, compared to the latency hit of a full round-trip to lookup the ID separately.

> I stick to UUIDs … with the exception of where there are justified … performance reasons not to.

A perfectly valid approach.

Though it does vary by DB, and what looks like a smell to someone with most posgres experience is often the more efficient method elsewhere, so you need to take that into account when looking at other projects (or working on your own if support for varied DBs in the backend is desired or a requirement).

cameronh90 · on Jan 20, 2023

What I mean is that in most cases, BOTH autoinc int and UUID is wasteful.

For example if you have a situation where you really need the high performance of an integer ID like in your SQL Server example, why introduce a UUID into the equation at all? If you are in the unusual situation of needing such extreme performance that you're worrying about 4 vs 8 bytes on the PK, but also need to obfuscate the public facing ID, _potentially_ int+UUID makes sense. But in my experience that is that a pretty rare situation and there are other things you can do such as using a shorter randomly generated ID such as snowflake, random integers with collision detection (depending on write patterns) or encrypting your ID at the application layer (caveats emptor but unless you're relying on IDs for absolute security they shouldn't be a problem).

However, defaulting to autoinc int as FK and publicly visible UUID for "user friendly" ID seems like an odd thing to do. It seems like one of the least useful ID schemes.

> For internal use you should just use the integer ID for the most part, the UUID or similar being for external references.

I disagree. UUIDs are very useful even as internal identifiers in any area where performance isn't your top concern.

UUIDs are better than autoincs in almost every way except being slightly less performant. You could argue for a strictly internal ID that the security and uniqueness advantages don't matter very much, but I think it's better to default to the safest option in case your ID inadvertently becomes public at some point, or indeed in case you want to one day make public a previously internal-only record. But even if you know, for sure, that your ID will only ever be internal, they still make it possible to merge datasets easily, they're easy to correlate across multiple systems, are easier to find in log files, and can be generated at the application layer, which can make a big difference in transaction time in some cases.

The only reasons not to use UUIDs are that they are ugly and marginally slower, which for the vast majority of entities doesn't particularly matter. If you're at the point where the rest of your code is so optimal that the UUID is causing you problems, or your records are so tiny that the UUID is a big overhead, then that is an exception that I am very happy to make, but it rarely applies. Too often people assume that the overhead of a UUID is worse than it is because of how long they look, but then when you actually benchmark it, it's swings and roundabouts.

Besides, in most cases where I've defaulted to integers, I've come to regret that decision a few years down the line, where some complex system migration or new business requirement would have ended up much easier if we'd just bit the UUID bullet earlier on.

hot_gril · on Jan 20, 2023

> What I mean is that in most cases, BOTH autoinc int and UUID is wasteful.

> I disagree. UUIDs are very useful even as internal identifiers in any area where performance isn't your top concern.

If performance isn't your top concern, having both seems fine. If it is, the int is probably faster anyway.

cameronh90 · on Jan 20, 2023

If you don't care about performance, then obviously it's fine from a performance perspective to have both... but what does it actually give you other than problems?

hot_gril · on Jan 20, 2023

Not having to import a DB extension or get a lib to generate the UUIDv4s. Not exposing your primary keys. Keeping an ordering of insertion, if you need that. Not having to read huge strings during debug. Overall it's kinda minor as UUID is at least a much better option than semantic PKs, bigger concern is performance.

Performance-wise, bigserial is probably a lot faster than UUID as a PK, even if you also have a UUID secondary index. PKs are used in tons of places in the DB and in your server-side code. The DBMS is also gonna be optimized around the regular way of doing things.

What problem do you get from bigserial? The one thing I can think of is, if you're trying to merge two databases together for whatever reason, you can't just copy the entire rows. So you copy all cols except the ID, let them get new IDs, and use the secondary identifiers when copying in related tables. It's more work, but you don't do this often, and if you do, you can automate it.

cameronh90 · on Jan 20, 2023

What I said was what's the point of having both? If you have both, you still need a UUID generation library. As I said previously, I can see the performance advantages of using ints in some cases, but in my experience they're relatively rare so I don't see the point to default to them.

As for the other advantages of UUID, I and others have covered many of them above: security, fewer roundtrips, shardable, easier to find in logs, data warehousing, backups, disaster recovery, etc. etc. The advantages of UUIDs are so great that my view is, in any serious app, you actually need to justify _not_ using them with concrete performance data that shows why using ints is a worthwhile trade-off. There are cases where ints make sense.

However my main point is, as I said, that having both ints AND uuids is of very limited usefulness.

hot_gril · on Jan 20, 2023

> If you have both, you still need a UUID generation library

Oops, yeah, that's true.

The point of having both is that you're already using the bigserial as the PK, and you also have opaque, external-facing identifiers that refer to particular rows in some tables without you having to expose any PKs. Might only be some tables, might be some other kind of string rather than a UUID. There may even be multiple ways for users to refer to something in your system; you keep that all separate from your PKs.

> security, fewer roundtrips, shardable, easier to find in logs, data warehousing, backups, disaster recovery

I see security pros/cons on both sides; exposing PKs to users does feel wrong to me though. I don't see how UUID PKs reduce roundtrips; if your API takes a UUID, you don't have to convert it to a row ID right away, only when you're actually querying what data the client wants. If you're printing row keys in logs, you ought to prefix them either way (like "user:35" or "user:deadbeef-..."). For backups/recovery, I haven't found UUIDs helpful, maybe cause I never want to just copy whole rows.

A sharded DB takes special consideration and could go many different ways, so defaulting to UUIDs in anticipation of sharding one day is probably not going to help when that day comes. In some setups, the PK is just for that one node, and you have a global ID across nodes (which may be a composite of node-local PK + shard ID). Or you're switching to a specialized, not-so-relational DBMS for horizontal scaling. Like, serial IDs are a terrible idea in Google Spanner.

> you actually need to justify _not_ using them with concrete performance data

For what it's worth, I encountered this situation in a DB with millions of rows. UUID PKs were significantly increasing our overall application latency, so I switched us to bigserials. I'd rather not put newer systems on a track to hit that hurdle later on. It can start being a noticeable problem well before you're thinking about sharding.

boloust · on Jan 20, 2023

You're right to point at performance as the main motivator for this setup.

The primary key is included in all indexes, including non-clustered indexes, so in some cases there can be quite a large difference between UUID and integer PKs in terms of index size.

UUID PKs are also more susceptible to fragmentation.

tpetry · on Jan 20, 2023

Thats not how PostgreSQL works. The primary key is only included in every secondary key for MySQL. PostgreSQL secondary indexes directly point at the page and rowid.

dspillett · on Jan 20, 2023

It works this way in SQL Server too, and some other DBs, if you have a clustered index (usually recommended). The clustering key is included in all non-clustered indexes on the table. Not that this doesn't mean NCIs inherit any extra fragmentation potential from the clustering key, as it is effectively INCLUDed and not considered by of the key of the supporting index.

Postgres tables are more like what SQL Server calls a heap table (one without a clustering key). Some of the issues that make clustered tables the standard recommendation in SQL Server are very similar to those that make VACUUM a requirement in postgres. IIRC postgres tables are more efficient than SQL Server's heap tables in most cases because they are the only option so are actively optimised for, where in SQL Server head tables are generally (in all but the few circumstances where they are more efficient) considered a second class type.

AtlasBarfed · on Jan 20, 2023

IMO use a UUID + a "type code" so something like:

xxxx-xxxx-xxxxxxxx-xxxx-CUST

xxxx-xxxx-xxxxxxxx-xxxx-ADDR

It makes seas of UUIDs much easier to reason about

Depends on your tolerance of wasted disk space for binary vs char, but you can shorten the binary to base64 or use a record as a primary key if you want.

Other advantages of UUIDs:

- they can be generated by clients or by the server safely

- they can be concurrently and distributedly generated without a central sequence blocking/locking ID generation

- they are distinct/unique across system migrations and mergers of systems / data / table

- time UUIDs can encode some info about when generated which can help with forensics / debugging in production

- no database-specific behavior for sequence generation, no extra database object for the sequence

- probably helps with data warehouse / data oceans for keeping data distinct and tracing back to source system

- similarly to that, for integrating systems, also makes the ids unique across system boundaries

- they are a bit more secure as stated elsewhere

cameronh90 · on Jan 20, 2023

For interacting with humans, I rarely expose a UUID directly. I'm not particularly bothered about ugly URLs but I don't work in an industry where SEO is relevant.

As you suggest, I sometimes incorporate type information into the ID to convey a bit more context, often in the form of a URL (org.com/customer/xxxxx). If you do need to put a UUID in the UI, depending on the constraints of the application, it might make sense to just display the first group of characters from the UUID and separately handle the very rare collisions you may encounter, similar to git and its short SHAs.

For any situation where there will be transcription or copying IDs between systems manually, I will typically add another group that incorporates some metadata about where the ID came from (similar to the type code you mention) and a check digit, but obviously I try to avoid any situation that involves transcribing a UUID.

PestoDiRucola · on Jan 20, 2023

One thing I don’t see being mentioned in this thread (I only skimmed the article, so I don’t know if it’s mentioned there) is that you can run out of numbers when using serial, as they have a max, so if you are planning to have a table which will have over 2147483647 rows, then you might look into other types to use as a unique identifier.

arp242 · on Jan 20, 2023

You can use bigserial instead of serial, which goes up to 9223372036854775807.

dspillett · on Jan 20, 2023

If your DB supports unsigned integers, or starting sequences from -2,147,483,648, you can double the address range.

But if you are at all worried that you'll get within a couple of orders of magnitude of MAXINT32 in the lifetime of your application then you should immediately jump to 64-bit values. Doubling is often just noise and the cost of refactoring if you approach MAXINT32 much faster than expected is more or a problem than the extra storage cost of bigger keys.

arp242 · on Jan 20, 2023

PostgreSQL doesn't have unsigned integers.

dspillett · on Jan 21, 2023

Nor does MS SQL Server, though you can start IDENTITY from a maximally negative value. But any time I'd consider that I'd just jump up to an larger integer type instead.

aeyes · on Jan 20, 2023

Just use biginteger everywhere, still more efficient than even UUIDv4.

tylergetsay · on Jan 20, 2023

For anyone wondering, this is around 60 records per second for a year before you max out

marcosdumay · on Jan 20, 2023

> you can run out of numbers when using serial

Yes, that's why you use bigserial.

PestoDiRucola · on Jan 20, 2023

But the comment I replied to was mentioning serial though.

badestrand · on Jan 20, 2023

I think you can just make that dependent on the table's data. For a table of user posts use bigserial but for a list of companies, prices or names the serial will surely be sufficient.

hot_gril · on Jan 20, 2023

When I first started using Postgres, I sat and thought forever about which PKs to use, and looking back, I was way overthinking it. Combination of unique fields, UUIDs, hashed data... Now I always use bigserial without thinking about it. When my DBA hat is on, it's none of my concern how the user-facing IDs will look; I just know it's gonna be a string of some kind.

What's the common use case for the others? I can imagine for weird performance reasons you might want to pick special PKs, but that implies you're exposing them to clients, which you almost never want. The only more reasonable thing I can think of is a UUIDv4 for a special sharded database.

okl · on Jan 20, 2023

Two benefits of UUIDs:

1. You can create a series of related rows without hitting the DB to obtain the next integer.

2. During development, if you mix up ids you will get an error/empty result. With integer keys you might get a row you didn't intend to get, hiding the error.

saiya-jin · on Jan 20, 2023

I hate UUIDs with passion currently - not postgres but recently spent so much extra time on relatively small table (35 mil records in few columns) and doing some queries and updating subset of it.

UUIDs there is stored in Oracle 'raw' datatype which to me is the worst combination possible, basically string stored in small binary blob, due to binary nature all needs to be converted to hex all the time for matching and readability, atrocious performance on stored procedures. Absolutely worst DB design I've seen in past 20 years, and we talk about expensive core anonymization service of top big banking package.

dspillett · on Jan 20, 2023

> basically string stored in small binary blob, due to binary nature all needs to be converted to hex all the time for matching and readability, atrocious performance on stored procedures

But this give a significant performance benefit for storage (storing in the display format gives 4 bits per byte, or less if you include decorations like the '-' characters, rather than 8) and more importantly when joining (it doesn't need to convert for this, so on a 64-bit architecture each comparison is a pair of 64-bit compares in the CPU) rather than a more complex string comparison.

If you are converting between string and binary representations more often than on input to your stored procedures or for output, then something is very wrong (the query planner are likely not able to use indexes that it could too, so scanning instead of seeking).

marcosdumay · on Jan 20, 2023

AFAIK, the Oracle way is to store them as Number. The type is made just large enough for them.

But if you want to talk about Oracle's usability, there are much larger fish to fry. I wouldn't recommend anybody to use that database.

saiya-jin · on Jan 20, 2023

more often than not one doesn't have any choice in DB, especially when it comes bundled in product like in my case

gregwebs · on Jan 20, 2023

There is an umentioned security aspect that you should be aware of for adopting timestamp-based ids: you are leaking information about time, and this information could be sensitive.

This is how I would summarize a security perspective.

    * autoincrement id: leaks information about the system as a whole. Users can attack each other. Might be suitable for an internal-only application or an application that doesn't care about leaking this information and goes to great effort to be resilient to users attacking each other.
    * timestamp + random id: leaks information about the time the individual record was created. An attacker can attempt to learn sensitive information about an individual. Suitable for a record that is already publicly shared with its time (e.g. a tweet). Might be suitable otherwise if ids are not public. That is only the record creator can view the id and you don't send out links with the ids to the user (particularly over insecure channels such as email).
    * random id: does not leak information. suitable for any use case that is okay with the performance implications (of a non-sortable fragemented index).

I am wary of how they call xid the best time-based id. It just removes all (run-time) randomness and thus performs the best. xid seems to be the same as MongoDB's oid. It is designed to be a conflict-free timestamp that can be used in a distributed system, and it is good at that. But in terms of protecting users for some use cases it could be worse than an auto-increment id because cross-user attacks are still possible (they will take many, many more attempts though) and it leaks information about time.

voxic11 · on Jan 20, 2023

Another thing about autoincrements, they can leak rate information. You can do something like create a new user, wait a day, create a second user, then the difference in the userId tells you the rate at which new users are being created.

ThePhysicist · on Jan 20, 2023

I've had good success with using auto-incrementing BIGINTs as internal IDs and creating an additional BYTEA field as external IDs. Foreign keys would be based on the internal IDs, anything user-facing would use external IDs. I think it's a good compromise as it keeps foreign key size small and still allows hiding internal structure from users.

silvestrov · on Jan 20, 2023

I agree, this is the way to do it. Anything originating from the outside is bound to change, being email address, social security numbers, ...

The post also completely ignores foreign keys.

It is an absolute advantage to size and speed to have foreign keys to be int4/int8 and not an email address or UUID.

jetpackjoe · on Jan 20, 2023

Rather than use an extra column, I’ve taken to hashing the internal key (with a salt based on the entity type and some secret) to create the external facing ID.

aeyes · on Jan 20, 2023

If it's hashed, how do you get back to the internal ID if you only have the external ID?

I used encryption instead because I can reverse it.

arp242 · on Jan 20, 2023

One thing I've done is using the customer name codepoints of every character + object ID formatted as base-36.

So with a customer email of 'martin@arp242.net' and an object ID of 52 you end up with 1563 + 52 = 1615, or 18v in base-36. You can add a "base number" to make it a but larger, e.g. 50,000 so it becomes "13tr".

I'm sure people can figure this scheme out with enough effort; it's certainly not cryptographically secure, but it's "hidden enough" for many purposes, not much longer than numeric IDs (shorter in many cases), doesn't require any special DB-fu, and is reversible if you know the customer (which you usually do).

xaferel · on Jan 20, 2023

That's a good idea but doesn't it need 2 round trips if you use an auto-increment primary key? First insert and then update by hashing the new id.

grncdr · on Jan 20, 2023

Not the poster you’re replying to, but with this approach you generally don’t store the hashed identifier. Just encode/decode at the application boundaries.

Too · on Jan 21, 2023

How do you decode a hash?

grncdr · on Jan 22, 2023

ah sorry, I was not very precise about "hashed".

What I meant (and have done in the past) is to encrypt/decrypt the auto-incrementing ID at the application boundaries.

If the OP is really using a one-way hash then yes, you would have to store both IDs.

ThePhysicist · on Jan 20, 2023

Interesting approach, will consider that in the future! Though not sure if it's a good idea as that couples internal & external IDs, i.e. it's not possible to change one without also changing the other (but also not sure if that's really an issue).

hot_gril · on Jan 20, 2023

This seems expensive and requires you to really know what you're doing with the cryptography. Why not just use a random external facing ID?

filleokus · on Jan 20, 2023

That's really clever. Have you encountered any problems with it in practice?

Ankhers · on Jan 20, 2023

Not the person you replied to, but I have had some issues with this in the past. Though it is more with how it was done than the approach itself.

In this project I do not believe the IDs were always encrypted when being sent to the user. So we sometimes had to guess whether we received an encrypted ID vs a regular integer ID because it is possible for the encryption algo that was used to return a sequence of numbers.

hot_gril · on Jan 20, 2023

This is the standard way of doing it.

nargella · on Jan 20, 2023

Disclaimer: not a dba so my terms might not be appropriate

I’ve seen uuid4 which replaces the first 4 bytes with a timestamp. It was mentioned to me that this strategy allows postgres to write at the end of the index instead of arbitrarily on disk. I also presume it means it has some decent sorting.

[inspiration](https://github.com/tvondra/sequential-uuids/blob/master/sequ...)

djbusby · on Jan 20, 2023

I use ULID, 128 bits, time and great sorting

https://github.com/ulid/spec

lawrjone · on Jan 20, 2023

Wrote about my experience using ulids in Postgres if people are considering it: https://blog.lawrencejones.dev/ulid/

josevalerio · on Jan 21, 2023

Good post thanks :)

dchuk · on Jan 20, 2023

Is there any way to have the database generate these automatically vs your application?

cpburns2009 · on Jan 20, 2023

The common databases don't support natively support generating ULIDs to my knowledge. You can usually find extensions if you prefer generating them in the database instead of the application. I generate them in the application, and store them as a UUID in PostgreSQL to avoid needing any database extensions.

djbusby · on Jan 20, 2023

Yea, there are a few extensions for PG, in C and Go that give a ulid_create() function that can be used as column default, just like serial.

grep_it · on Jan 20, 2023

It also has the advantage that the page being written to, the right most leaf at the end of the index, is likely to always be available in the page cache. With random you may need to constantly go to disk to fetch the page.

s4i · on Jan 20, 2023

In this sequential UUIDs idea, I wonder how big of a deal it is if the prefix part wraps around often? E.g. using a timestamp-based prefix with 2 bytes, if you increase the prefix every 60 seconds, the prefix will be reset every 45 days or so (60 * 1000 * 2^16) according to that README. Does it make sense to fine tune this value based on the use case or what?

rollcat · on Jan 20, 2023

Are there any clear downsides to sequential prefixes on UUIDs? I would imagine if you're producing new objects at a high enough rate, you'd have a lot of prefix collisions, which would hinder search times. I've never benchmarked to confirm that though.

s4i · on Jan 20, 2023

If the prefix is incremented for every new ID, you essentially have the same problem as you do with serial: you leak information about the amount of rows in some timeframe.

As the link posted above mentions, you can alternatively use a timestamp-based prefix that wraps around after all the bits have been used. This one still leaks possible creation times of the record, so it's on par or better compared to UUIdv6, ULID, etc. (because here the exact creation time can't necessarily be deduced).

In all of these UUID solutions apart from the fully random v4, you are trading of the better index performance with some level of information leakage about the record the ID is associated with.

glacials · on Jan 20, 2023

The author misses one advantage of UUIDs: if you’re working in high-throughput distributed systems, serial IDs create a bottleneck and single point of failure in the service handing out IDs.

With UUIDs any service can generate an ID itself and tell downstream services about it in parallel—even if one of them is down, slow, or needs retrying.

abujazar · on Jan 20, 2023

This can also be achieved in distributed systems by having each node skip IDs equivalent to the number of nodes in the cluster. E.g. node 1 in a 5 node cluster assigns ids 1, 6, 11 and node 2 assigns 2, 7, 12 and so on.

remram · on Jan 20, 2023

You can also hand out ranges. Instead of asking the system for one ID, you ask it for 1000 contiguous IDs, and you ask for another range when you run out. This will reduce the load on the ID-creating-system by 1000x (or your choice of number) and has the advantage that systems don't need to know how many peers they have. (beware of the herd though, if you don't persist it every system will come asking at startup)

bob1029 · on Jan 20, 2023

True, but then you have to plan ahead quite a bit.

The other advantage of UUIDs is in completely decoupled environments that need to be able to share entities with each other. In this situation, serialization of activity is not a concern at all - we simply wish to prevent collisions of keys across the way.

cyclotron3k · on Jan 20, 2023

Another great advantage of UUIDs is that it can help prevent you shooting yourself in the foot when you accidentally join the wrong tables. E.g. `DELETE FROM users WHERE id IN (SELECT id FROM user_orders);`

Too · on Jan 21, 2023

This also enables treating inserts as upserts, allowing safe retries without risk of creating duplicates.

Speaking of. Does pg have any mechanism to protect double creates while using auto increments? Is there any way to provide a request-id from the client?

geophile · on Jan 20, 2023

What does "SORT terribly" mean? That there is no semantically useful ordering? Well of course not, that's not what they are designed for. If you want ordering by time, then include a time-based column and sort on it. Does it mean that sorting performance is bad on UUID columns? Why?

And what does "index terribly" mean? You can index UUID columns just fine, so is it a performance concern? What is the concern?

perrygeo · on Jan 21, 2023

Because B-tree indexes are ordered, rows likely to be adjacent on disk (written in time order) are not at all adjacent in the index and vice-versa. There is no "hot page" in the cache representing recent records; the index node you need for any given uuid is random and makes your internal index pages effectively uncacheable. The result is increased IO and cache thrashing.

avinassh · on Jan 21, 2023

> Because B-tree indexes are ordered, rows likely to be adjacent on disk (written in time order) are not at all adjacent in the index and vice-versa.

But this still doesnt matter, right? If you want time ordering you'd prolly have some field like `created_at`

perrygeo · on Jan 21, 2023

Rows are generally written to disk in the order they're inserted, whether you have a timestamp or not. It's just how database IO works.

If your index happens to be in a different order (the position in the index is uncorrelated to the position of the row), you're gonna be thrashing pages. This isn't unique to UUID keys - text keys or anything other than current timestamp or serial id are going to have the same problem.

It's really only an issue once your index size exceeds available shared RAM. If you can fit your table's index completely in memory, you probably won't notice. But once you exceed that threshold, performance starts falling off a cliff.

geophile · on Jan 30, 2023

Yes and no. Now if your index fits in the L2 cache, then you've got something. Otherwise, you have to keep loading the cache from RAM and performance starts falling off a cliff.

avinassh · on Jan 22, 2023

That is a great point! This would be fun to benchmark

yencabulator · on Jan 21, 2023

Consider inserting 1000 rows.

If those rows have keys that sort near each other, you're changing a few pages.

If those rows have keys that are all over the keyspace, you're changing roughly 1000*k pages.

orf · on Jan 20, 2023

Not a fantastic post, which is a shame as Supabase is quite an interesting company.

The answer is almost always "use biginteger identity", and almost never "use integer serial".

UUIDs have a place but are often better suited in larger, distributed and more complex data stores than postgres.

Using `xid` is such a poor choice I'm surprised it was even mentioned.

The "key" thing to remember is you don't have to expose your primary key to the world. Use UUIDs or shortcodes or whatever for external representations. Use bigints internally. This will prevent a world of pain.

jdwyah · on Jan 20, 2023

For times when you need distributed generation, I’ve worked with a system that I liked. Server kept track of a sequence. Clients pull out batches of 1000 or so and then use them up. When the client starts to get low on available numbers it fetches another batch.

The ids generated are nice readable integers. Generally in sorted order, though not a guarantee, and you end up with gaps sometimes if a client doesn’t give out all its numbers before it’s restarted.

Would anyone be interested in a super robust version of that as a service?

rco8786 · on Jan 20, 2023

Good intro article. I'd always heard that serial ints aren't guaranteed to be ordered but never knew why (because they are generated non-transactionally..so if an INSERT transaction rolls back the id that would have been used is effectively consumed/skipped).

What I see a lot in practice is a bigint numeric id for internal use (better for joins, FKs) and also a textual token for public use, perhaps with a typed prefix indicate the type of record it's identifying (U-AS234FDS for User, etc)

okl · on Jan 20, 2023

I guess mean "consecutive" or "sequential" instead of "ordered"

rco8786 · on Jan 21, 2023

Oops yes I sure did. Thanks

lysecret · on Jan 20, 2023

My opinion. Always if in any way possible pick a semantic key. There is usually something defining the thing you are working on. If there isnt work on your normalisation.

Main benefits to this: Avoids accidental duplication (happens so much). Avoids additional round trips to fetch the id to make a mutation.

Of course if you work on something where you don’t know what it is yet (actually humans are a good example for that) uuid or int might make sense but I hear so many times picking non semantic as a default.

williamdclt · on Jan 20, 2023

I'd heavily push for the exact opposite. Every single time I've seen a primary key being defined with a natural key, it turned out that this set of attributes wasn't as immutable as we thought actually and it caused a world of pain.

I find that there actually rarely is something defining the thing you're working on. The concept of "immutable identity" is rarely a useful thing in digitalized systems:

- being able to create a new digital entity for the same real-life entity is almost always useful and expected ("the setup of this user is all messed up, just disable it and create a new one")

- attributes that you thought were immutable actually are not ("surely the 'originally scheduled time' of an event is an immutable property" - except when you have a bug and events are scheduled at the wrong time and you need to fix data)

- the concept of "immutable identity" is often pretty subjective in the real world. We generally agree that a person has an immutable identity, sure, but is a 9am appointment that's moved to a week later the same appointment, or a new one? Depends on who you ask, depends on what purposes you need this concept of "identity" for.

hot_gril · on Jan 20, 2023

> Every single time I've seen a primary key being defined with a natural key, it turned out that this set of attributes wasn't as immutable as we thought actually and it caused a world of pain.

Yep, happens to the best of us. Never mess around with this, just use a bigserial.

hot_gril · on Jan 20, 2023

(Or other things like UUIDs, which I would not use, are still way better than semantic keys.)

magicalhippo · on Jan 20, 2023

For one of our customers we integrate heavily with their ERP system.

They had nice sub-10 million integers as order numbers, so we used that as primary key for the order table (and as fk for 10+ child tables). Last year they changed ERP system, and now order numbers are much longer and can contain dashes.

Not the worst change, as converting integer to varchar is lossless, but we had to go over all the views and our code to make sure it could handle it.

thewataccount · on Jan 20, 2023

I'm strongly against these, I've always regretted it.

SKUS, emails, etc all seemed liked good keys. They're always unique right?

Until one day when they decided to rename some skus, they suddenly want family accounts, you realize you really do want the ability to have duplicates so you can keep historical copies without ripping up your entire database.

Semantics change. A UUID/whatever does not.

I've learned you should never ever use a natural key. PKs are extremely difficult or near impossible to replace depending on your application, and if that really means using serial id's, or UUIDs, or extra lookups - it's worth doing that instead of using natural keys.

the_duke · on Jan 20, 2023

Almost every time I picked a natural key I have regretted it in the long run.

Requirements change, or index sizes get bloated and hamper performance, or you need a nice, short ID for URLs , or foreign keys get more complicated with compound primary keys, or ...

Natural keys are nice in theory, but not so much in practice.

Especially if we are talking about active databases where large migrations are a burden.

manoDev · on Jan 20, 2023

To add to the other comments mentioning why this is difficult in practice:

It's the "Ship of Theseus" paradox [1]. Choosing a semantic key means mixing identity and attribute, while a synthetic key solves by assuming "constitution is not identity".

Since a digital system is a model of the world, a synthetic key allows the system to address objects in this internal model without assuming a particular interpretation of identity in the real world. E.g., it's often the case that you do need to have two "customer" entries in your system that represent the same physical "person" in the world, and this is ok because the concept of "customer" is useful and sufficient for your model, and the physical person isn't.

More often than not, people get caught on this trap in relational databases and object oriented modelling. This can be seen in books and lectures that use Customer-Order-Product relations to teach databases, or Car-Engine to teach OO.

[1] https://en.wikipedia.org/wiki/Ship_of_Theseus

hot_gril · on Jan 20, 2023

Ship of Theseus is the thing I was looking for. Our dept has repeatedly suffered from this same mistake. Our team was stuck with this damn three-part PK that also gets exposed to partner teams using our service as an identifier, and is now baked into their code and their clients' code. Now we took it out as the PK, but it's still used as an identifier. We're making a bunch of hacks to ensure that ID stays unique.

The analogy went further when the partner team asked, can we have a "stable ID" that doesn't change even if that identifier changes. Our team was close to exposing our new row-level keys again. I asked, if the name changes, is it the same thing still? What if the name and the other attributes change? This isn't like a social media post that obviously has an identifier; we're modeling physical objects. Why do you need this feature again? Turns out they didn't need the feature, or even quite understand what they were asking.

The way our application is, really we didn't have to expose any identifier with guarantees about uniqueness.

cccybernetic · on Jan 20, 2023

The existence of a synthetic key contradicts the idea that we're modeling the "world". Of course, this is a debate that’s been around forever, and I understand the advantages of synthetic keys, but I've found that the intuitive elegance of a natural key will often flow into the business logic, untangling nests of code dedicated to id look-ups, id-matching, filtering, mapping, and the general slicing and dicing and shaping of data. Queries that once referenced obscure ids now point directly to fields which intuitively make sense: yes, if I want to update the user_apps table, I know or can reasonably infer pk(user_id, app_id), and I know those values and I have them right here in my pocket, don't need to look them up, and they mean something very tangible and real and I don’t have to say “…WHERE id = ‘…’”

Of course, nothing is ever perfect and natural keys have their issues, especially when migrating data, but there’s something about them that’s always “clicked” with how I reason about software systems.

thewataccount · on Jan 20, 2023

I originally thought the same - I manage the backend/database for a small team that works on our warehouse/website integration with a legacy ERP system.

My experiences with semantic keys has been awful. I've been told "Oh this [property] will either never need to change, and never have duplicates" of properties you'd think really should never change, several times.

Somehow a different department decided to change the skus. I've seen emails need to either have duplicates or be changed. I've seen a few instances of order numbers from the ERP system having duplicates under certain circumstances. A few cases like GTINs where every item should already have a unique one assigned - until we have an item that is missing one and will never be assigned one.... The big one is needing to archive things or keep histories of objects - if you use a synthetic key it's super easy to just add a "active" flag and you have to change very little code.

I totally get wanting to avoid extra queries and extra steps, but if the need to change a pk _ever_ occurs it's terrible to near impossible (in the case of external dependencies). Many frameworks like django make it fairly easy to naturally add that extra step without needing to make spaghetti to do your lookups

cyclotron3k · on Jan 20, 2023

I have to disagree with you there, for all the reasons mentioned above, but also it makes generating queries and tooling so much harder.

If you see a user_id column, you know it's going to link to user.id. if you see email_id, you know it's going to link to email.id, etc, etc. There's a lot of value in having a predictable schema.

silvestrov · on Jan 20, 2023

Semantics tend to change over time, so you will now have to change the meaning of your key.

Did you know that sometimes the same social security number is assigned to multiple persons? In this case a person can change social security number. Good luck updating all of your database foreign keys in this case.

williamdclt · on Jan 20, 2023

Updating foreign keys in a database is at least doable; updating external systems that have a reference to this entity is impossible

tragomaskhalos · on Jan 20, 2023

For relational purists, e.g. Joe Celko, this approach is pretty much mandatory. However as others have stated in reply it's very difficult to find a genuinely immutable identifying attribute in many domains, including human beings.

jsmith45 · on Jan 20, 2023

Semantic key keys have downsides, and it is also important to note that there is not always a single unique semantic key.

Relation theory has the concept of superkeys, which are a set of columns that uniquely determine a row. But these are not useful, since for example, the set all all columns should uniquely determine a row. What is useful is "candidate keys", which are minimal superkeys, with any columsn not necessary to be unique removed.

There are two ways to determine candidate keys. One is empirically by analyzing the data. If you do it that way, then the "candidate" naming is appropriate, since it is possible that some sets of columns are unique simply by chance, not by fundamental nature, and unique by chance is not what we want.

Alternatively you can use domain knowledge and logic to determine candidate keys. Candidate keys determined by logic are true keys, since no duplicates should ever occur unless the requirements or fundamental nature of the data changes. This means that ideally, all such keys should have a unique constraint placed on them (although the implicit unique constraint from marking as a primary key will works for one of these keys). Adding unique constraints for all logically determined candidate keys is the ideal way to avoid accidental duplication.

Within the database and within the application, you ideally want to only use small keys that are unlikely to change, and are unlikely to ever become non-unique. Keys that change tend to cause headaches with updates if referenced elsewhere, and you can have undesirable race condition issues with application logic on changing keys.

Similarly, for keys likely to become non-unique from changing requirements, using them within the database means a much bigger refactor later if they become no longer unique. But if you never use those value to reference within the database, then simply dropping the unique constraint is easy. Impact on application code may vary, from potentially no change needed at all, to much more significant changes, depending on the data in question and how the application uses it.

Large keys that don't change, and are extremely unlikely to ever become non-unique are conceptually fine, but have the practical problem of being large, and thus undesirable to reference from all over the database from a file size perspective. This is especially true of multi-column keys which also tend to be inconvenient from a query writing perspective.

Another important issue is that many identifiers that are supposed to be universally unique, like UPCs, ISBNs etc, are not actually always unique. These things do end up getting occasionally reused, usually accidentally. If you are using that everywhere as your primary key, and eventually come across such a scenario, it is a real nightmare to refactor everything to use a different key in order to be able to handle this. While if you are using some surrogate key almost everywhere, it becomes a lot more feasible to handle this with things like having "lookup by UPC" screens show a list of options when you stumble upon one that happens to have a duplicate.

jbverschoor · on Jan 20, 2023

Pretty uninformative post. Goes from count(*) to some uuids, but fails to see the bigger picture.

Depending on the data, but assuming most data isn't big data:

- Use integers internally, maybe suffixed by a shard-id to prevent collision, but keep order.

- Use (random) external ids to access from the outside.

Note that certain uuids will still leak some information: time between records, number of machines, etc.

atombender · on Jan 21, 2023

One major aspect of primary keys not mentioned in the article is the performance of queries. I did some benchmark recently, where serial (32 bits) was significantly faster than the native Postgres UUID type. The worst scheme you can use is a string; if you encode something like ksuid as a plain string rather than an efficiently packed byte sequence, query performance becomes significantly worse. I didn't benchmark sorting, but I assume it's similarly impacted.

lawrjone · on Jan 20, 2023

Funny this keeps coming up!

I wrote about my experience using ulids the other day, specifically with Postgres and some of the dis/advantages you get with it.

It's a deeper dive into ulids than this article is, and shows some real world issues that crop up:

https://blog.lawrencejones.dev/ulid/

That said, and spoiler alert: I'd probably go with bigint-sequence backed text IDs if I were choosing this over again.

jeffomatic · on Jan 20, 2023

There's some skepticism in the comments around the recommendation for xid. I'm curious if anyone here is using it in production at scale, and can comment on the practical realities.

I saw xid make the rounds about a year ago, and the promise of a pseudo-sortable 12-byte identifier that is "configuration free" struck me as a bit far-fetched.

In particular, I wondered if the xid scheme gives you enough entropy to be confident you wouldn't run into collisions. UUIDv4 doesn't eat a full 16 bytes of entropy for nothing. For example, if you look at the machine ID component of xid, it does some version of random assignment (either pulling the first three bytes from /etc/machine-id, or from a hash of the hostname). 3 bytes is 16777216 values, i.e., with 600 hosts you have a 1% chance of running into a collision. Probably too close for comfort?

There are settings where you can build some defense-in-depth against ID collisions, like a uniqueness constraint in your DB (effectively a centralized ticketing system). But there are many settings where that kind of thing wouldn't be practical. Off the top of my head, I'm thinking of monitoring-type applications like request or trace IDs.

deepsun · on Jan 20, 2023

One more note: uuid is not easily copy-pastable, due to dash `-` in it. I prefer to use it's raw bytes and encode with base32, which is copy-pastable.

antifa · on Jan 21, 2023

If the goal is copy-pastable, why not base36 or base62?

yencabulator · on Jan 21, 2023

Because typing them or reading them back over the phone is much less human-friendly.

So far https://philzimmermann.com/docs/human-oriented-base-32-encod... has the best trade-offs I've ever seen.

(Though for encoding numbers, where small numbers deserve a shorter representation, Base58 can be nice too.. z-base-32's alphabet is still more user-friendly. https://github.com/tv42/base58 )

deepsun · on Jan 27, 2023

Base32 is already implemented in most stdlibs, and I'm lazy to write it myself :)

conaclos · on Jan 20, 2023

Hybrid Logical Clock [1] could be of interest for readers. This is a monotonically increasing clock based on a physical clock.

Combined with a machine identifier you can obtain globally unique identifiers that are totally ordered.

[1] https://cse.buffalo.edu/~demirbas/publications/hlc.pdf

say_it_as_it_is · on Jan 20, 2023

Probably worth revising this post to include BIGINT and BIGSERIAL, and then changing the entire Supabase schema to use them as well.

je42 · on Jan 20, 2023

One note regarding uuid. It doesn't need to imply it is random. That's specific v4.

V5 are predictable uuids. That combine a ns uuid and a string, via sha1 based one way mapping resulting in a uuid.

quartz · on Jan 20, 2023

Given the advantages of sortable UUIDs and this post's conclusion that XID is the best option right now, is Supabase planning to add pg_idkit to their list of supported extensions?

kiwicopple · on Jan 20, 2023

>pg_idkit to their list of supported extensions

For some of these simpler extensions, we're looking at using AWS's TLE (https://github.com/aws/pg_tle), which would allow user-contributed extensions. If we can pull that off, we'll probably look again at the current set of extensions we offer and then see which ones can be ported to a TLE instead

jpalomaki · on Jan 20, 2023

Size also matters. If you are running cheaper instances on cloud, they have limited IO and can be short on memory.

Smart data types for keys, enums instead of varchars etc helps to keep indexes small.

philliphaydon · on Jan 20, 2023

I’ve been using HiLo for so long this isn’t something I think about.

I don’t see a point in trying to hide the ID. Either it’s public or it’s private and should be verified before being accessed.

yencabulator · on Jan 21, 2023

> I don’t see a point in trying to hide the ID.

Exposing sequential numbers tells your competition the size of your company's user base, user activity levels, and growth.

https://en.wikipedia.org/wiki/German_tank_problem#Historical...

deepsun · on Jan 20, 2023

As for DB ids I prefer to have some integer internal DB ID, only for foreign keys, and additionally something like UUID for client-side.

waspight · on Jan 20, 2023

I only use uuid as PK in Postgres. So many benefits, it is easy to migrate data between environments, client can generate its own ids etc.

I don’t know about performance but I think in most cases that is not a big concern anyway.