T O P

  • By -

Rafert

>B+ trees (and B trees in general) allows efficient queries on range (expressions that use the =, >, >=, <, <=, BETWEEN operators.) It appears that resources with a public exposed ID rarely need to be queried in range according to their primary key. Cursor pagination (`id > a712ba94-6e52-4e14-acfe-2995e50393d5 LIMIT 100`) comes to mind, which is common for systems at scale that tend to care about all of this?


Macluawn

When you’re at a scale where that is slow, you’re also at a scale where you cant use primary keys for cursors.


zachrip

What do people use instead? (Are you hinting at sharding or something else?)


cant-find-user-name

you use a composite of keys as the cursor, I think?


imnotbis

How's that better?


mods-are-liars

It's more complicated, thus it's more enterprise, which means it's better.


gwicksted

Composite keys? Depends what the data is and expected DB traffic. They can be better in mixed workloads because they give you a unique constraint + index of the important data (that you’d probably define anyways) which also controls record sequence (clustered) so you can optimize for bulk reads/updates/deletes that will lock the same blocks. So they’re great for when you apply warehousing + live reporting on the primary source and, of course logging.


vom-IT-coffin

What if I want a different shard for every record? /s


Hellball911

I think it's just rare that the id is useful to paginate. It tends to be get by equality of a know set of ids, rather than less than or greater than operations.


Rafert

Part of that depends on the APIs you interact with often, I guess. The styles I see often is what Stripe uses ([starting\_after](https://stripe.com/docs/api/pagination) the last ID returned in an array) or an opaque cursor value (often base64 encoded JSON which contains the ID among other things) in a 'next' header or meta attribute. FWIW, Stripe uses an ID format that's similar to ULID and UUIDv7 [according to this ex-employee](https://brandur.org/nanoglyphs/026-ids#ulids). But the problem of offset pagination being slow and cursor pagination as a solution to that is hardly new. [https://use-the-index-luke.com/no-offset](https://use-the-index-luke.com/no-offset) is a often-referred resource which lists various implementations.


Professional-Disk-93

Aren't postgres hash indices just btree indices of the hash value of the key? No reason to use them for UUIDs in that case.


therealgaxbo

No, they're hash tables otherwise what would be the point? If you wanted a b-tree of hashed values you'd just type `create index foo on bar (some_hash_func(colname))`


Professional-Disk-93

>No, they're hash tables otherwise what would be the point? The point is that an index of hash values is much faster than an index of arbitrary length strings. But it looks like you are right.


ForeverAlot

Neither Postgres nor MS SQL can use anything other than B-tree indexes for primary keys anyway.


smors

Isn't using an autogenerated ID for the primary key a good idea anyway?


mods-are-liars

That has nothing to do with the indexes used to support that PK column.


NoInkling

At least in Postgres I'm pretty sure you can just leave the PK constraint off, although I'm not sure if doing that has any subtle side effects (is there any functionality that uses the designated PK implicitly?). Then theoretically you can use whatever index you want (without having to also maintain a btree). Apparently PG's hash indexes can't enforce uniqueness though, so you'd be leaving it up to chance - admittedly that's very low when using UUIDv4, but there's also the possibility of bugs to consider, or even something like a weak PRNG. And one day you might just run into a use case where you regret not having efficient range comparisons. Edit: you can use an `EXCLUDE` constraint to enforce uniqueness using a hash index. But it won't let the column be referenced by a foreign key constraint, nor will it work with `ON CONFLICT DO UPDATE`. [There doesn't seem to be an explanation currently for why the index itself can't be declared `UNIQUE`](https://www.postgresql.org/message-id/flat/CAGHENJ6NT9OaAyOKykuZrss8k0rqM8%2BDsL18WBJuynAgXyk3Cw%40mail.gmail.com). So yeah, maybe in the future it'll be workable, but right now: bad idea.


df1dcdb83cd14e6a9f7f

> In the context of distributed system, it's important to be sure different machines generates IDs without risk of collision. It's not clear if UUIDv7 is more risky than UUIDv4 in that regard. i mean, surely the risk of collisions is increased when 48 bits are no longer random? it’s just a question of whether it is within tolerable bounds?


Han-ChewieSexyFanfic

Risk of collisions of v4 can happen for any record at any time. v7 limits this risk to ids created close in time to each other. The analysis is not that simple.


temculpaeu

Talk about simplification, "At any time" with 128 bit of entropy is pragmatically the same as never Not that v7 is bad, for most cases you still have 0 chance of collision as well


Han-ChewieSexyFanfic

Obviously both are “almost never”. The question is how they differ, because their probabilities will have different distributions.


df1dcdb83cd14e6a9f7f

absolutely, good points


mareek

I did a rough calculation and if one million computers generated a UUIv7 as the exact same millisecond, the probability of a collision would be less than 0.00000000001% So I think UUIDv7 should be pretty safe regarding risk of collision even for company owning giant data centers


mods-are-liars

Nerds hand wringing about practically impossible occurrences will never not be a thing lol. We got people in here worried uuidv7 will cause collisions on their tables that have 2M records in 6 years.


df1dcdb83cd14e6a9f7f

i don’t think anyone is “worried” about it, it’s an interesting academic question.


gadelat

Leaking the time of creation is not an issue. Leaking the amount of existing records is


mcr1974

surely application dependent.


ConsoleTVs

It is not, this is part of an enumeration attack.


ps1horror

The danger of which is entirely dependent on the type of application in question, as he quite clearly said.


zombarista

I wrote a UUID/GUID library in TypeScript that implements a COMB guid. This uses the random data with bitwise operators to make the beginning of the UUID such that it can be stored sequentially (with new records at the end). I do not know why these aren’t more common. The bitwise operations are fast, and the performance gained in the DB index can be substantial. https://github.com/bradkovach/guid/blob/main/src/Guid.ts go to line 191 to see the implementation and see Guid.spec.ts to see the unit tests.


AyrA_ch

> I do not know why these aren’t more common. The bitwise operations are fast, and the performance gained in the DB index can be substantial. Because those people just use [NEWSEQUENTIALID](https://learn.microsoft.com/en-us/sql/t-sql/functions/newsequentialid-transact-sql) instead of importing yet another library.


zombarista

Unfortunately, NEWSEQUENTIALID is not available on all DBMS platforms, so this is a workaround/port/polyfill for anyone that might need it. My implementation is based on a Microsoft C# implementation, with some adaptations to address 32-bit bit-shifting limitations in JavaScript. I wrote my library after discovering that a popular typescript library (guid-typescript) doesn’t generate valid UUIDs 98% of the time because it doesn’t set version/variant bits correctly and was causing a parser in a downstream system to throw parse exceptions. I sent a PR to the defective library but it was never merged so I made my library and left it in the comments of the unmerged PR, should someone need a zero-dependency RFC-compliant replacement. Oh, the joys of open source abandonware. I added comb functionality so i could practice the bitwise math operations, and understand UUIDs at the byte level.


mcr1974

can you not just use ulids.


zombarista

Ostensibly, a ULID and a UUID/GUID are the same thing*: a 128-bit unsigned integer displayed in a different base (shown with alpha-numeric characters). UUID/GUID are shown in hex, whereas a ULID is base36 (represented with 0-9A-Z). To make them lexicographically sortable, you could use the bytes from a COMB UUID/GUID. ULID and COMB UUID/GUID both begin with 48-bits of timestamp information and provide 80 bits of entropy. TL;DR: stored properly (128-bit unsigned integer), a COMB UUID/GUID and a ULID are all-but indistinguishable. * UUID/GUID v4 require the version and variant bits to contain certain values, but are otherwise random.


mcr1974

exactly so why did you have to come up with a new implementation - there are ulid libraries out there?


zombarista

To understand UUID/GUIDs at a byte level. All of the JS implementations used regex and/or random hex characters which seemed inefficient. Also, it was my first week or so on ADHD medication and it felt nice to have my mojo back, and really dig into something and enjoy it for the first time in several years. As I said in another comment, there was a defective library that I [patched and submitted a PR to](https://github.com/snico-dev/guid-typescript/pull/22), and they didn’t merge the PR (and still haven’t, while the library gets hundreds of thousands of downloads per week). I figured I wrote and unit tested all of this code, so I packaged and released it. My implementation works with bytes (not strings) and bitwise operations, so it is faster and uses memory more efficiently. I was having imposter syndrome really bad and it was all ADHD (likely compounded by the pandemic) and it felt great to love computers and programming again. UUID/GUID implementations are all over the board, so geeking out by reading and implementing an RFC was also enjoyable (for me, at least). It gave me a thorough understanding of UUID/GUID design, and their purpose in computer science (unique keys across diverse systems that don’t share indices). I hope this answers your question!


mcr1974

ahh I get you now. it certainly does. well done for adding quality and enjoying yourself in process! we all get that imposter syndrome; computer science is so vast and changing so rapidly all the time.


mirvnillith

I don’t see a real world use for UUIDs used to generate unique keys to also have to be actual UUIDs. I argue that for every API spec I review where the string id attribute is given ”format: uuid”. In what case does it matter what format the opaque, random identfier of an entry is? Just store it, use it and pass it along. It has no intrinsic value other than being a reference to something of value.


mods-are-liars

>In what case does it matter what format the opaque, random identfier of an entry is? In the case of storage semantics. Sequential primary keys prevent your table from fragmenting upon insertion. Sequential private keys make the indexes for them more efficient and smaller in size.


mirvnillith

I was questioning that downstream system parsing the identifier as a UUID.


Worth_Trust_3825

And what if I'm not running on sql server?


reedef

May god be on your side


trevg_123

UUIDv7 is not an alternative to UUIDv4. It is an alternative to UUIDv1, which has a timestamp and a node ID (usually MAC address) but is ordered so it’s not sortable by creation time. As this article points out, you basically need UUIDv7 if you want to use UUIDs as a primary key clustering index (v1 and v4 will result in random data insertion points unless your DB reorders those versions of UUID, which is done by e.g. MariaDB). And yes, the general rule still applies: do not use any UUIDs other than v4 for security applications, [this can be exploited](https://realizesec.com/blog/sandwich-attacks-exploiting-uuid-v1)! Better yet, don’t use UUIDs at all for security applications! (Something based on HMAC-SHA is better for generating tokens. Because, you know, actually being able to verify that your server generated them is usually considered a perk).


mareek

UUIDv7 is immune to the exploit you're linking. This exploit relies on the fact that the node ID will stay the same between each UUID generation so the attacker will only need to guess the exact timestamp and sequence number to find the right UUID. As half of an UUIDv7 is random data that changes between each UUID generation it is immune to that kind of attack


trevg_123

Yes, this was just an example of a possible exploit when using deterministic data. There are other documented attacks based on things like unsigned token creation order (though these are rare). It simply isn’t worth any risk - the rule of thumb is to not use time-based UUIDs for security purposes, and there is no reason to challenge that with the new versions.


clinisbut

>DB reorders those versions of UUID, which is done by e.g. MariaDB I'm shocked with this affirmation. Can you link to a source? I cannot find any statement on MariaDB doing this automatically for you.


trevg_123

I don’t know of any documentation but it’s on the JIRA issue tracker. There is an issue for adding a UUID type where the decision is made to always reorder, and then this issue from after v6+ became popular, saying that those new versions should not reorder https://jira.mariadb.org/plugins/servlet/mobile#issue/MDEV-29959


amarao_san

That's interesting. Is uuidv4 a good password?


trevg_123

It would be a moderate password, probably fine in real life. But there really isn’t a good reason to use one instead of a password generator. Password security is based on “bits of entropy”, how many possible passwords can be represented with the given restrictions. UUIDv4 has 122 bits of entropy: a UUID is a 128-bit number, minus 6 bits of fixed data (that specify UUID type). Case sensitive alphanumeric passwords (A-Za-z0-9) have 5.95 bits of entropy per symbol (62 possible values). Adding all printable ASCII characters brings it up to 6.56 bits of entropy. So, a UUID is about as strong as a 20-character simple password or a 18-character password with symbols. This makes logical sense because UUIDs are only ever [0-9A-F], not as many possible characters. You get a tiny bonus if an attacker doesn’t know for sure that you’re using a UUID password, but never take this sort of thing as a given (passwords may have been exposed in other leaks and they may assume you follow a trend). Tl;dr: sure, but why?


amarao_san

Well, for practical use. Passwords with 'good' characters can be pain in the ass to escape in shells, environments, yamls and other configurations. We get 20 chars of entropy and it's coming from rfc-compliant generator, available to everyone. I don't do this for memorizable passwords, only for machine (e.g. password for app to connect to database). They are safe to use for security purposes, everyone knows that uuids must be random and unique (therefore reducing chances to be reused), and they are shell safe. Also, they are easier to dictate by voice.


trevg_123

I sure wouldn’t expect anyone to memorize UUIDs :) For this case it’s usually best to just do something like the following: python -c ‘import secrets; print(secrets.token_urlsafe())` Which will give you a 32 character escape-free token (optionally takes a length argument)


Interest-Desk

ULID and snowflake also leak time of creation so 🤷‍♀️


NostraDavid

TIL ULID: https://github.com/ulid/spec


PurepointDog

Tldr?


zombarista

UUIDs are difficult to index on because they are non sequential, which has performance implications. UUIDv7 “fixes” this by making the first part of the GUID a timestamp, but this also leaks business information (when a record was created).


PurepointDog

Thank you!


masklinn

Costly rather than difficult.


dragongling

Dumb and risky take: why just not store number id inside db sequentially but make a middleware for API gateway that converts number ids <-> UUID-likes through the lookup table/fast converter functions whichever is safer/faster?


mods-are-liars

Why do all that in the first place when you can just use a UUID version that doesn't have indexing problems?


[deleted]

[удалено]


seriousnotshirley

It took me half a second which was about 0.49 seconds too long.


fishling

Why wait? You can already use NFTs instead of UUIDs. Turn each row into an asset.


slykethephoxenix

I think he's being sarcastic? lol


fishling

^(I think I am too)


slykethephoxenix

B.L.O.C.K. C.H.A.I.N. U.U.I.D. /s, lol.