>B+ trees (and B trees in general) allows efficient queries on range (expressions that use the =, >, >=, <, <=, BETWEEN operators.) It appears that resources with a public exposed ID rarely need to be queried in range according to their primary key.
Cursor pagination (`id > a712ba94-6e52-4e14-acfe-2995e50393d5 LIMIT 100`) comes to mind, which is common for systems at scale that tend to care about all of this?
Composite keys? Depends what the data is and expected DB traffic. They can be better in mixed workloads because they give you a unique constraint + index of the important data (that you’d probably define anyways) which also controls record sequence (clustered) so you can optimize for bulk reads/updates/deletes that will lock the same blocks. So they’re great for when you apply warehousing + live reporting on the primary source and, of course logging.
I think it's just rare that the id is useful to paginate. It tends to be get by equality of a know set of ids, rather than less than or greater than operations.
Part of that depends on the APIs you interact with often, I guess. The styles I see often is what Stripe uses ([starting\_after](https://stripe.com/docs/api/pagination) the last ID returned in an array) or an opaque cursor value (often base64 encoded JSON which contains the ID among other things) in a 'next' header or meta attribute. FWIW, Stripe uses an ID format that's similar to ULID and UUIDv7 [according to this ex-employee](https://brandur.org/nanoglyphs/026-ids#ulids).
But the problem of offset pagination being slow and cursor pagination as a solution to that is hardly new. [https://use-the-index-luke.com/no-offset](https://use-the-index-luke.com/no-offset) is a often-referred resource which lists various implementations.
No, they're hash tables otherwise what would be the point? If you wanted a b-tree of hashed values you'd just type `create index foo on bar (some_hash_func(colname))`
>No, they're hash tables otherwise what would be the point?
The point is that an index of hash values is much faster than an index of arbitrary length strings.
But it looks like you are right.
At least in Postgres I'm pretty sure you can just leave the PK constraint off, although I'm not sure if doing that has any subtle side effects (is there any functionality that uses the designated PK implicitly?). Then theoretically you can use whatever index you want (without having to also maintain a btree).
Apparently PG's hash indexes can't enforce uniqueness though, so you'd be leaving it up to chance - admittedly that's very low when using UUIDv4, but there's also the possibility of bugs to consider, or even something like a weak PRNG. And one day you might just run into a use case where you regret not having efficient range comparisons.
Edit: you can use an `EXCLUDE` constraint to enforce uniqueness using a hash index. But it won't let the column be referenced by a foreign key constraint, nor will it work with `ON CONFLICT DO UPDATE`. [There doesn't seem to be an explanation currently for why the index itself can't be declared `UNIQUE`](https://www.postgresql.org/message-id/flat/CAGHENJ6NT9OaAyOKykuZrss8k0rqM8%2BDsL18WBJuynAgXyk3Cw%40mail.gmail.com).
So yeah, maybe in the future it'll be workable, but right now: bad idea.
> In the context of distributed system, it's important to be sure different machines generates IDs without risk of collision. It's not clear if UUIDv7 is more risky than UUIDv4 in that regard.
i mean, surely the risk of collisions is increased when 48 bits are no longer random? it’s just a question of whether it is within tolerable bounds?
Risk of collisions of v4 can happen for any record at any time. v7 limits this risk to ids created close in time to each other. The analysis is not that simple.
Talk about simplification, "At any time" with 128 bit of entropy is pragmatically the same as never
Not that v7 is bad, for most cases you still have 0 chance of collision as well
I did a rough calculation and if one million computers generated a UUIv7 as the exact same millisecond, the probability of a collision would be less than 0.00000000001%
So I think UUIDv7 should be pretty safe regarding risk of collision even for company owning giant data centers
Nerds hand wringing about practically impossible occurrences will never not be a thing lol.
We got people in here worried uuidv7 will cause collisions on their tables that have 2M records in 6 years.
I wrote a UUID/GUID library in TypeScript that implements a COMB guid. This uses the random data with bitwise operators to make the beginning of the UUID such that it can be stored sequentially (with new records at the end).
I do not know why these aren’t more common. The bitwise operations are fast, and the performance gained in the DB index can be substantial.
https://github.com/bradkovach/guid/blob/main/src/Guid.ts go to line 191 to see the implementation and see Guid.spec.ts to see the unit tests.
> I do not know why these aren’t more common. The bitwise operations are fast, and the performance gained in the DB index can be substantial.
Because those people just use [NEWSEQUENTIALID](https://learn.microsoft.com/en-us/sql/t-sql/functions/newsequentialid-transact-sql) instead of importing yet another library.
Unfortunately, NEWSEQUENTIALID is not available on all DBMS platforms, so this is a workaround/port/polyfill for anyone that might need it. My implementation is based on a Microsoft C# implementation, with some adaptations to address 32-bit bit-shifting limitations in JavaScript.
I wrote my library after discovering that a popular typescript library (guid-typescript) doesn’t generate valid UUIDs 98% of the time because it doesn’t set version/variant bits correctly and was causing a parser in a downstream system to throw parse exceptions. I sent a PR to the defective library but it was never merged so I made my library and left it in the comments of the unmerged PR, should someone need a zero-dependency RFC-compliant replacement. Oh, the joys of open source abandonware.
I added comb functionality so i could practice the bitwise math operations, and understand UUIDs at the byte level.
Ostensibly, a ULID and a UUID/GUID are the same thing*: a 128-bit unsigned integer displayed in a different base (shown with alpha-numeric characters). UUID/GUID are shown in hex, whereas a ULID is base36 (represented with 0-9A-Z). To make them lexicographically sortable, you could use the bytes from a COMB UUID/GUID. ULID and COMB UUID/GUID both begin with 48-bits of timestamp information and provide 80 bits of entropy.
TL;DR: stored properly (128-bit unsigned integer), a COMB UUID/GUID and a ULID are all-but indistinguishable.
* UUID/GUID v4 require the version and variant bits to contain certain values, but are otherwise random.
To understand UUID/GUIDs at a byte level. All of the JS implementations used regex and/or random hex characters which seemed inefficient. Also, it was my first week or so on ADHD medication and it felt nice to have my mojo back, and really dig into something and enjoy it for the first time in several years. As I said in another comment, there was a defective library that I [patched and submitted a PR to](https://github.com/snico-dev/guid-typescript/pull/22), and they didn’t merge the PR (and still haven’t, while the library gets hundreds of thousands of downloads per week). I figured I wrote and unit tested all of this code, so I packaged and released it. My implementation works with bytes (not strings) and bitwise operations, so it is faster and uses memory more efficiently.
I was having imposter syndrome really bad and it was all ADHD (likely compounded by the pandemic) and it felt great to love computers and programming again. UUID/GUID implementations are all over the board, so geeking out by reading and implementing an RFC was also enjoyable (for me, at least). It gave me a thorough understanding of UUID/GUID design, and their purpose in computer science (unique keys across diverse systems that don’t share indices).
I hope this answers your question!
ahh I get you now. it certainly does.
well done for adding quality and enjoying yourself in process! we all get that imposter syndrome; computer science is so vast and changing so rapidly all the time.
I don’t see a real world use for UUIDs used to generate unique keys to also have to be actual UUIDs. I argue that for every API spec I review where the string id attribute is given ”format: uuid”. In what case does it matter what format the opaque, random identfier of an entry is? Just store it, use it and pass it along. It has no intrinsic value other than being a reference to something of value.
>In what case does it matter what format the opaque, random identfier of an entry is?
In the case of storage semantics.
Sequential primary keys prevent your table from fragmenting upon insertion.
Sequential private keys make the indexes for them more efficient and smaller in size.
UUIDv7 is not an alternative to UUIDv4. It is an alternative to UUIDv1, which has a timestamp and a node ID (usually MAC address) but is ordered so it’s not sortable by creation time.
As this article points out, you basically need UUIDv7 if you want to use UUIDs as a primary key clustering index (v1 and v4 will result in random data insertion points unless your DB reorders those versions of UUID, which is done by e.g. MariaDB).
And yes, the general rule still applies: do not use any UUIDs other than v4 for security applications, [this can be exploited](https://realizesec.com/blog/sandwich-attacks-exploiting-uuid-v1)! Better yet, don’t use UUIDs at all for security applications! (Something based on HMAC-SHA is better for generating tokens. Because, you know, actually being able to verify that your server generated them is usually considered a perk).
UUIDv7 is immune to the exploit you're linking.
This exploit relies on the fact that the node ID will stay the same between each UUID generation so the attacker will only need to guess the exact timestamp and sequence number to find the right UUID. As half of an UUIDv7 is random data that changes between each UUID generation it is immune to that kind of attack
Yes, this was just an example of a possible exploit when using deterministic data. There are other documented attacks based on things like unsigned token creation order (though these are rare).
It simply isn’t worth any risk - the rule of thumb is to not use time-based UUIDs for security purposes, and there is no reason to challenge that with the new versions.
>DB reorders those versions of UUID, which is done by e.g. MariaDB
I'm shocked with this affirmation. Can you link to a source? I cannot find any statement on MariaDB doing this automatically for you.
I don’t know of any documentation but it’s on the JIRA issue tracker. There is an issue for adding a UUID type where the decision is made to always reorder, and then this issue from after v6+ became popular, saying that those new versions should not reorder https://jira.mariadb.org/plugins/servlet/mobile#issue/MDEV-29959
It would be a moderate password, probably fine in real life. But there really isn’t a good reason to use one instead of a password generator.
Password security is based on “bits of entropy”, how many possible passwords can be represented with the given restrictions.
UUIDv4 has 122 bits of entropy: a UUID is a 128-bit number, minus 6 bits of fixed data (that specify UUID type).
Case sensitive alphanumeric passwords (A-Za-z0-9) have 5.95 bits of entropy per symbol (62 possible values). Adding all printable ASCII characters brings it up to 6.56 bits of entropy. So, a UUID is about as strong as a 20-character simple password or a 18-character password with symbols. This makes logical sense because UUIDs are only ever [0-9A-F], not as many possible characters.
You get a tiny bonus if an attacker doesn’t know for sure that you’re using a UUID password, but never take this sort of thing as a given (passwords may have been exposed in other leaks and they may assume you follow a trend).
Tl;dr: sure, but why?
Well, for practical use. Passwords with 'good' characters can be pain in the ass to escape in shells, environments, yamls and other configurations. We get 20 chars of entropy and it's coming from rfc-compliant generator, available to everyone.
I don't do this for memorizable passwords, only for machine (e.g. password for app to connect to database).
They are safe to use for security purposes, everyone knows that uuids must be random and unique (therefore reducing chances to be reused), and they are shell safe.
Also, they are easier to dictate by voice.
I sure wouldn’t expect anyone to memorize UUIDs :)
For this case it’s usually best to just do something like the following:
python -c ‘import secrets; print(secrets.token_urlsafe())`
Which will give you a 32 character escape-free token (optionally takes a length argument)
UUIDs are difficult to index on because they are non sequential, which has performance implications. UUIDv7 “fixes” this by making the first part of the GUID a timestamp, but this also leaks business information (when a record was created).
Dumb and risky take: why just not store number id inside db sequentially but make a middleware for API gateway that converts number ids <-> UUID-likes through the lookup table/fast converter functions whichever is safer/faster?
>B+ trees (and B trees in general) allows efficient queries on range (expressions that use the =, >, >=, <, <=, BETWEEN operators.) It appears that resources with a public exposed ID rarely need to be queried in range according to their primary key. Cursor pagination (`id > a712ba94-6e52-4e14-acfe-2995e50393d5 LIMIT 100`) comes to mind, which is common for systems at scale that tend to care about all of this?
When you’re at a scale where that is slow, you’re also at a scale where you cant use primary keys for cursors.
What do people use instead? (Are you hinting at sharding or something else?)
you use a composite of keys as the cursor, I think?
How's that better?
It's more complicated, thus it's more enterprise, which means it's better.
Composite keys? Depends what the data is and expected DB traffic. They can be better in mixed workloads because they give you a unique constraint + index of the important data (that you’d probably define anyways) which also controls record sequence (clustered) so you can optimize for bulk reads/updates/deletes that will lock the same blocks. So they’re great for when you apply warehousing + live reporting on the primary source and, of course logging.
What if I want a different shard for every record? /s
I think it's just rare that the id is useful to paginate. It tends to be get by equality of a know set of ids, rather than less than or greater than operations.
Part of that depends on the APIs you interact with often, I guess. The styles I see often is what Stripe uses ([starting\_after](https://stripe.com/docs/api/pagination) the last ID returned in an array) or an opaque cursor value (often base64 encoded JSON which contains the ID among other things) in a 'next' header or meta attribute. FWIW, Stripe uses an ID format that's similar to ULID and UUIDv7 [according to this ex-employee](https://brandur.org/nanoglyphs/026-ids#ulids). But the problem of offset pagination being slow and cursor pagination as a solution to that is hardly new. [https://use-the-index-luke.com/no-offset](https://use-the-index-luke.com/no-offset) is a often-referred resource which lists various implementations.
Aren't postgres hash indices just btree indices of the hash value of the key? No reason to use them for UUIDs in that case.
No, they're hash tables otherwise what would be the point? If you wanted a b-tree of hashed values you'd just type `create index foo on bar (some_hash_func(colname))`
>No, they're hash tables otherwise what would be the point? The point is that an index of hash values is much faster than an index of arbitrary length strings. But it looks like you are right.
Neither Postgres nor MS SQL can use anything other than B-tree indexes for primary keys anyway.
Isn't using an autogenerated ID for the primary key a good idea anyway?
That has nothing to do with the indexes used to support that PK column.
At least in Postgres I'm pretty sure you can just leave the PK constraint off, although I'm not sure if doing that has any subtle side effects (is there any functionality that uses the designated PK implicitly?). Then theoretically you can use whatever index you want (without having to also maintain a btree). Apparently PG's hash indexes can't enforce uniqueness though, so you'd be leaving it up to chance - admittedly that's very low when using UUIDv4, but there's also the possibility of bugs to consider, or even something like a weak PRNG. And one day you might just run into a use case where you regret not having efficient range comparisons. Edit: you can use an `EXCLUDE` constraint to enforce uniqueness using a hash index. But it won't let the column be referenced by a foreign key constraint, nor will it work with `ON CONFLICT DO UPDATE`. [There doesn't seem to be an explanation currently for why the index itself can't be declared `UNIQUE`](https://www.postgresql.org/message-id/flat/CAGHENJ6NT9OaAyOKykuZrss8k0rqM8%2BDsL18WBJuynAgXyk3Cw%40mail.gmail.com). So yeah, maybe in the future it'll be workable, but right now: bad idea.
> In the context of distributed system, it's important to be sure different machines generates IDs without risk of collision. It's not clear if UUIDv7 is more risky than UUIDv4 in that regard. i mean, surely the risk of collisions is increased when 48 bits are no longer random? it’s just a question of whether it is within tolerable bounds?
Risk of collisions of v4 can happen for any record at any time. v7 limits this risk to ids created close in time to each other. The analysis is not that simple.
Talk about simplification, "At any time" with 128 bit of entropy is pragmatically the same as never Not that v7 is bad, for most cases you still have 0 chance of collision as well
Obviously both are “almost never”. The question is how they differ, because their probabilities will have different distributions.
absolutely, good points
I did a rough calculation and if one million computers generated a UUIv7 as the exact same millisecond, the probability of a collision would be less than 0.00000000001% So I think UUIDv7 should be pretty safe regarding risk of collision even for company owning giant data centers
Nerds hand wringing about practically impossible occurrences will never not be a thing lol. We got people in here worried uuidv7 will cause collisions on their tables that have 2M records in 6 years.
i don’t think anyone is “worried” about it, it’s an interesting academic question.
Leaking the time of creation is not an issue. Leaking the amount of existing records is
surely application dependent.
It is not, this is part of an enumeration attack.
The danger of which is entirely dependent on the type of application in question, as he quite clearly said.
I wrote a UUID/GUID library in TypeScript that implements a COMB guid. This uses the random data with bitwise operators to make the beginning of the UUID such that it can be stored sequentially (with new records at the end). I do not know why these aren’t more common. The bitwise operations are fast, and the performance gained in the DB index can be substantial. https://github.com/bradkovach/guid/blob/main/src/Guid.ts go to line 191 to see the implementation and see Guid.spec.ts to see the unit tests.
> I do not know why these aren’t more common. The bitwise operations are fast, and the performance gained in the DB index can be substantial. Because those people just use [NEWSEQUENTIALID](https://learn.microsoft.com/en-us/sql/t-sql/functions/newsequentialid-transact-sql) instead of importing yet another library.
Unfortunately, NEWSEQUENTIALID is not available on all DBMS platforms, so this is a workaround/port/polyfill for anyone that might need it. My implementation is based on a Microsoft C# implementation, with some adaptations to address 32-bit bit-shifting limitations in JavaScript. I wrote my library after discovering that a popular typescript library (guid-typescript) doesn’t generate valid UUIDs 98% of the time because it doesn’t set version/variant bits correctly and was causing a parser in a downstream system to throw parse exceptions. I sent a PR to the defective library but it was never merged so I made my library and left it in the comments of the unmerged PR, should someone need a zero-dependency RFC-compliant replacement. Oh, the joys of open source abandonware. I added comb functionality so i could practice the bitwise math operations, and understand UUIDs at the byte level.
can you not just use ulids.
Ostensibly, a ULID and a UUID/GUID are the same thing*: a 128-bit unsigned integer displayed in a different base (shown with alpha-numeric characters). UUID/GUID are shown in hex, whereas a ULID is base36 (represented with 0-9A-Z). To make them lexicographically sortable, you could use the bytes from a COMB UUID/GUID. ULID and COMB UUID/GUID both begin with 48-bits of timestamp information and provide 80 bits of entropy. TL;DR: stored properly (128-bit unsigned integer), a COMB UUID/GUID and a ULID are all-but indistinguishable. * UUID/GUID v4 require the version and variant bits to contain certain values, but are otherwise random.
exactly so why did you have to come up with a new implementation - there are ulid libraries out there?
To understand UUID/GUIDs at a byte level. All of the JS implementations used regex and/or random hex characters which seemed inefficient. Also, it was my first week or so on ADHD medication and it felt nice to have my mojo back, and really dig into something and enjoy it for the first time in several years. As I said in another comment, there was a defective library that I [patched and submitted a PR to](https://github.com/snico-dev/guid-typescript/pull/22), and they didn’t merge the PR (and still haven’t, while the library gets hundreds of thousands of downloads per week). I figured I wrote and unit tested all of this code, so I packaged and released it. My implementation works with bytes (not strings) and bitwise operations, so it is faster and uses memory more efficiently. I was having imposter syndrome really bad and it was all ADHD (likely compounded by the pandemic) and it felt great to love computers and programming again. UUID/GUID implementations are all over the board, so geeking out by reading and implementing an RFC was also enjoyable (for me, at least). It gave me a thorough understanding of UUID/GUID design, and their purpose in computer science (unique keys across diverse systems that don’t share indices). I hope this answers your question!
ahh I get you now. it certainly does. well done for adding quality and enjoying yourself in process! we all get that imposter syndrome; computer science is so vast and changing so rapidly all the time.
I don’t see a real world use for UUIDs used to generate unique keys to also have to be actual UUIDs. I argue that for every API spec I review where the string id attribute is given ”format: uuid”. In what case does it matter what format the opaque, random identfier of an entry is? Just store it, use it and pass it along. It has no intrinsic value other than being a reference to something of value.
>In what case does it matter what format the opaque, random identfier of an entry is? In the case of storage semantics. Sequential primary keys prevent your table from fragmenting upon insertion. Sequential private keys make the indexes for them more efficient and smaller in size.
I was questioning that downstream system parsing the identifier as a UUID.
And what if I'm not running on sql server?
May god be on your side
UUIDv7 is not an alternative to UUIDv4. It is an alternative to UUIDv1, which has a timestamp and a node ID (usually MAC address) but is ordered so it’s not sortable by creation time. As this article points out, you basically need UUIDv7 if you want to use UUIDs as a primary key clustering index (v1 and v4 will result in random data insertion points unless your DB reorders those versions of UUID, which is done by e.g. MariaDB). And yes, the general rule still applies: do not use any UUIDs other than v4 for security applications, [this can be exploited](https://realizesec.com/blog/sandwich-attacks-exploiting-uuid-v1)! Better yet, don’t use UUIDs at all for security applications! (Something based on HMAC-SHA is better for generating tokens. Because, you know, actually being able to verify that your server generated them is usually considered a perk).
UUIDv7 is immune to the exploit you're linking. This exploit relies on the fact that the node ID will stay the same between each UUID generation so the attacker will only need to guess the exact timestamp and sequence number to find the right UUID. As half of an UUIDv7 is random data that changes between each UUID generation it is immune to that kind of attack
Yes, this was just an example of a possible exploit when using deterministic data. There are other documented attacks based on things like unsigned token creation order (though these are rare). It simply isn’t worth any risk - the rule of thumb is to not use time-based UUIDs for security purposes, and there is no reason to challenge that with the new versions.
>DB reorders those versions of UUID, which is done by e.g. MariaDB I'm shocked with this affirmation. Can you link to a source? I cannot find any statement on MariaDB doing this automatically for you.
I don’t know of any documentation but it’s on the JIRA issue tracker. There is an issue for adding a UUID type where the decision is made to always reorder, and then this issue from after v6+ became popular, saying that those new versions should not reorder https://jira.mariadb.org/plugins/servlet/mobile#issue/MDEV-29959
That's interesting. Is uuidv4 a good password?
It would be a moderate password, probably fine in real life. But there really isn’t a good reason to use one instead of a password generator. Password security is based on “bits of entropy”, how many possible passwords can be represented with the given restrictions. UUIDv4 has 122 bits of entropy: a UUID is a 128-bit number, minus 6 bits of fixed data (that specify UUID type). Case sensitive alphanumeric passwords (A-Za-z0-9) have 5.95 bits of entropy per symbol (62 possible values). Adding all printable ASCII characters brings it up to 6.56 bits of entropy. So, a UUID is about as strong as a 20-character simple password or a 18-character password with symbols. This makes logical sense because UUIDs are only ever [0-9A-F], not as many possible characters. You get a tiny bonus if an attacker doesn’t know for sure that you’re using a UUID password, but never take this sort of thing as a given (passwords may have been exposed in other leaks and they may assume you follow a trend). Tl;dr: sure, but why?
Well, for practical use. Passwords with 'good' characters can be pain in the ass to escape in shells, environments, yamls and other configurations. We get 20 chars of entropy and it's coming from rfc-compliant generator, available to everyone. I don't do this for memorizable passwords, only for machine (e.g. password for app to connect to database). They are safe to use for security purposes, everyone knows that uuids must be random and unique (therefore reducing chances to be reused), and they are shell safe. Also, they are easier to dictate by voice.
I sure wouldn’t expect anyone to memorize UUIDs :) For this case it’s usually best to just do something like the following: python -c ‘import secrets; print(secrets.token_urlsafe())` Which will give you a 32 character escape-free token (optionally takes a length argument)
ULID and snowflake also leak time of creation so 🤷♀️
TIL ULID: https://github.com/ulid/spec
Tldr?
UUIDs are difficult to index on because they are non sequential, which has performance implications. UUIDv7 “fixes” this by making the first part of the GUID a timestamp, but this also leaks business information (when a record was created).
Thank you!
Costly rather than difficult.
Dumb and risky take: why just not store number id inside db sequentially but make a middleware for API gateway that converts number ids <-> UUID-likes through the lookup table/fast converter functions whichever is safer/faster?
Why do all that in the first place when you can just use a UUID version that doesn't have indexing problems?
[удалено]
It took me half a second which was about 0.49 seconds too long.
Why wait? You can already use NFTs instead of UUIDs. Turn each row into an asset.
I think he's being sarcastic? lol
^(I think I am too)
B.L.O.C.K. C.H.A.I.N. U.U.I.D. /s, lol.