Date: | October 1, 2009 / year-entry #312 |
Tags: | code |
Orig Link: | https://blogs.msdn.microsoft.com/oldnewthing/20091001-00/?p=16533 |
Comments: | 36 |
Summary: | A customer was encountering a problem with lots of duplicate GUIDs. How is that possible? The whole point of the GUID generation algorithm is to work hard to avoid duplication. Was one of the fundamental assumptions of the algorithm broken? Maybe there was a duplicate MAC? Was the clock regressing? One of my colleagues pointed... |
A customer was encountering a problem with lots of duplicate GUIDs. How is that possible? The whole point of the GUID generation algorithm is to work hard to avoid duplication. Was one of the fundamental assumptions of the algorithm broken? Maybe there was a duplicate MAC? Was the clock regressing? One of my colleagues pointed out that in the search for the subtle source of the problem, you sometimes overlook the obvious one. In fact, this is the most common source of problems with so-called duplicate GUIDs. As he so tersely puts it: "A GUID can easily be duplicated by simply copying it." In other words, you have a duplicate GUID because you duplicated the GUID. This can happen, for example, if you have a Boom, instant duplicate. |
Comments (36)
Comments are closed. |
We have some GUIDs in a SQL database. We have more than one development database, and a QA database, and a user-acceptance-testing database, and a user-training database (with some "anonymized" data), and a production database.
Some of the GUIDs are the same across these databases, even when they are used as record IDs, since they mean the same things from database to database. And some of the GUIDs are created on the fly and are not meant to be the same anywhere.
It’s tempting to dismiss using GUIDs as database keys, since they are "so long" (128 bits), but once you have realized that if someone passes you an ID and they claim it’s a customer number, and you can tell them no, it’s not a customer number, it’s an order number, then you may realize their value. You can’t do that if all tables have IDs that are auto-incrementing numbers.
Reminds me of the time somebody copy-pasted some code that he considered "boiler plate". It contained a GUID that really had to be unique for multiple of these "boiler plate" sections to coexist. This was, of course, realized much later when it actually became a problem.
So what you’re saying is that the subtle source of the problem is that dummy in front of the computer, otherwise known as a PEBKAC error.
There’s at least a few different "versions" of the globally (allegedly) unique ID – and thanks to privacy concerns, the MAC address is no longer a popular choice.
And unless you have a source of uniqueness, it actually is possible (though unlikely) to get duplicate IDs generated.
Dumb developers just use "the GUID generation algorithm" and expect it to work. Good developers understand it, and know the tradeoffs involved in the many different GUID algorithms they’re choosing from.
Of course, the subtle problems aren’t the only ones – really dumb developers blindly copy GUIDs in all sorts of crazy places.
@bob: "the MAC address is no longer a popular choice"
I don’t believe you. I think the MAC address is part of the algorithm. But I could be wrong.
@David Walker: probably instead of saying "I don’t believe you", it would be good to check: http://en.wikipedia.org/wiki/Globally_Unique_Identifier
@David:
http://msdn.microsoft.com/en-us/library/aa379322.aspx
"For security reasons, UuidCreate was modified so that it no longer uses a machine’s MAC address to generate UUIDs. UuidCreateSequential was introduced to allow creation of UUIDs using the MAC address of a machine’s Ethernet card."
@Ken Cox:
Ok, you’re probably just kidding, but there are of course ways to prove if a GUID is, in fact, globally unique. The exact proof depends on the method used.
And just to add: Using Random() four times to generate the 4 parts of a GUID is NOT a good method. And yes, I’ve seen code for GUID generating that did this. Actually, I’ve seen many really bad "GUID" generators as code for languages that don’t have anything built in.
Hey guys,
Don’t try to convince me that GUIDs are unique until you test *all* of them against each other. Until you do, their uniqueness is merely idle speculation! <grin>
Ken
I remember having to write a GUID generator from scratch – there were very few facilities I could use – I barely had a C library.
I used the MAC as it was unique, and the onboard fast-running counter – that’s all I had.
Coding PXE support into a bootloader…
Addenda to above:
No RTC – the clock ticked from zero every power-on. Presumably it ran NTP or something once the OS booted.
PXE was used to load in a bigger bootloader with more support – the boot flash was tiny and we only had around 64kB to be able to load the big loader from (much larger) NAND storage, validate it, and if not (e.g., corrupted or not present), do PXE to find a serve that could serve us the right bootloader, and write it into NAND.
Mail account GUIDs in Exchange are not unique? Why?
Several stupid people I know are copying project files (vbproj). VS doesn’t warn about this when loading av multiproject solutions (doesn’t check for dupes i suppose), but behaves strangely/erroneously when debugging. Why?
Hogwash. A GUID has 128 bits. If you generate 2^129 GUIDS you will definitely have duplicates. How are you going to prove that nobody on the planet has ever generated a certain GUID?
What’s really scarey is when a lazy developer just randomly increments one of the guid’s digits.
@Thomas:
With the right method, the sun will likely burn out before you run out of GUIDs. Do you have any idea how big a number 2^129 is??
@someone else: You’re practically right, there are so many possibilities that the probability of a duplicate is low. However this doesn’t mean that it can’t happen.
That’s why Thomas is right: you can’t prove that there won’t ever be duplicates or that anybody else has not already generated the same GUID. Try to prove it and you’ll get:
Step 1: In total there are 2^128 possible GUIDs.
Step 2: Thus as soon as 2^128+1 GUIDs have been created (by all people on all computers on the whole world) then at least one of them has to be a dup (pigeonhole principle).
OK, I may have been wrong about the MAC address. However, that link also says: "Computers with ethernet/token ring addresses generate UUIDs that are guaranteed to be globally unique". How does it guarantee them to be globally unique if some portion of the Ethernet or token ring hardware is not part of the equation?
As for the size of the GUID address space, while I’m sure some commenters were kidding, it’s claimed to be large enough to assign a unique GUID to every atom in the universe.
This universe, of course; we don’t consider multiverses here.
Which means, obviously, that you can’t even store them all. The address space is many, many orders of magnitude larger than you can even imagine.
While it is true that the pigeonhole principle doesn’t kick in until we hit 2^128 GUIDs, probabilistic methods suggest we’ll hit our first (non-trivial) duplicate GUID after 2^64 or so (birthday paradox.)
Ok, let’s look at this GUID algorithm that I just made up (because it works):
The first 48 bits will be the generators MAC address. The second will be a timestamp.
An 80-bit counter is enough for 38,334,786,263 years if you generate a GUID every nanosecond. Per computer. That should do.
Of course, there are still privacy concerns. Just think of a machine hooked up to the Internet where everybody can get one if needed and which is not supervised by anyone (we hope).
“[Don’t forget to handle the case when somebody asks for two GUIDs in rapid sucession (less than 1 microsecond apart). -Raymond]”
Um… Wait for a microsecond? Actually, I was thinking more along the lines of “the GUID generator generates GUIDs as fast as possible (and necessary; if none are needed, none are generated), and even if it generates a million per second, it will still last <insert long time span here>.” If you need GUIDs faster than it can generate, use a second one.
Birthday paradox indeed. But you should also consider the probability that two GUID’s will ‘meet’.
Pesonaly I think for GUID to implement Clone() is a bug. It doesn’t makes sense and is wrong. Clone is about creating a different object, so that when I poke in one the other stays unaffected. The objects have their own identity from the cloning on.
Come to think of it, you can abuse clone() to implement transactional behaviour.
-Raymond]
Oh, that should have been microsecond, not nano. Damn prefixes …
A GUID generator should be single threaded. And failsafe (first increase the counter, *then* generate the GUID). And even a 64-bit counter would last long enough (584,942 years at one GUID per microsecond).
(I didn’t look up the exact algorithm because that wasn’t necessary for the proof.)
@someone else:
That doesn’t even begin to guarantee global uniqueness. What it tries to do is guarantee uniqueness between different users of the same software. But consider what happens when someone without an Ethernet card generates a GUID. Or someone else’s GUID software uses (timestamp, MAC) instead of (MAC, timestamp). You can easily have collisions with other algorithms. Which is why there cannot be a proof of uniqueness.
Don’t forget that “when someone without an Ethernet card generates a GUID”, it’s not likely that anyone will ever know, hence there won’t be a collision in a meningful sense. The computer without an Ethernet card is pretty much limited to living in its own little world.
Unless the computer communicates with the outside world in some non-Ethernet or non-Ethernet-like way.
You’re right, Raymond, I overlooked that possibility.
GUIDs are not absolutely unique across all situations. They have 128 bits, not an infinite number of bits. The appropriate question is not to ask if there can be collisions (there can be) but how likely they are in a given situation. No matter how crafty your algorithm, you could always be tripped up by some clown with another algorithm designed to collide with yours. In particular, a trivial way of doing this is by repeating already generated GUIDs. Which is the subject of this post, so that’s coming full circle for you.
That reminds me of a debate I had on how to create a temporary file (ignoring the existing library functions for a moment). I proposed using a GUID for the file name and failing if a file by that name couldn’t be created, and was chewed out by two people who pointed out that GUIDs, despite the name, are not necessarily unique, not even on the same machine. In particular, it was pointed out that this didn’t take into account the existing files in the directory that might clash, despite the fact that this is much less likely to happen with GUIDs than with most other schemes (like GetTempFileName(), which is easily thwarted by a mere 65,536 pre-existing files of a particular name), even taking malice into account.
Instead they unequivocally agreed that using an algorithm that would deterministically fail only when the available names ran out was a better choice, even if that algorithm was less efficient. My argument that the *likelihood* of the GUID not being unique was so incredibly small that you would much sooner need to worry about your file system, your computer or your solar system breaking down was scoffed at; the possibility of failure by the algorithm itself, no matter how jawdroppingly minuscule, was unacceptable. They went to great lengths to point out my idea was "playing the odds" and fundamentally flawed.
I learned from that that some (most?) programmers do not appreciate probabilistic arguments and will only respect "certainty", even if the "certain" algorithm will "certainly" fail in reality much sooner than the "unpredictable" algorithm will, and even if the "certain" algorithm has other undesirable properties. This distrustful attitude towards small probabilities is healthy in general (you don’t want to hear someone argue that "this race condition is so unlikely") but elevating it to dogma is going too far.
“[“If you need GUIDs faster than it can generate, use a second one.” If you have two GUID generators, how do you ensure that they don’t both spit out the same GUID? Indeed the simplified algorithm virtually guarantees that they will! -Raymond]”
Because they have different network cards, and hence, different GUIDs. By “generator,” I meant an actual physical thingy, not a program.
“[Oh, good luck getting people to buy a GUID generator device whenever they need a GUID. -Raymond]”
That’s what we have the Internet for. Have some trusted parties set up public GUID generators. Yes, I firmly believe in (flat, reliable and reasonably fast) Internet access for everyone.
Using my patent-pending GUID generator physical thingy internet serivce, you can get all the unique GUIDs you want for the low, low price of US$0.05 each.
Remember that each GUID is completely unique – no two are alike. Therefore their value as collectables can only increase over time.
To get you started with your own GUID collection, here are 6 GUIDs *absolutely free*!!!.
– 1084b315-e21f-4fed-86e4-b2550c054b31
– 9a02559b-af6d-41b2-9e85-f25d0440d732
– fded0036-69da-498d-a000-24f04b47027c
– ed6fae15-a98c-4375-b359-b53e55334bcd
– 3e4f612b-9e8d-4894-b4e3-128b006db38d
– bce26c02-ea8f-4c64-9cd3-0b8eeda827be
You’ll never find an offer like this anywhere else or at any other time. Hurry before they run out!!!
Visit http://www.guid-ducks.com
Operators are waiting!
“[I’m pretty sure those public GUID generators would be asked to generate GUIDs faster than 1 per microsecond. "Your GUID requests are important to us, and they will be serviced in the order they were received." (I can see the bug reports now. "Your program runs slowly because the GUID generator is overloaded.") -Raymond]”
Cache, distribute, compress. Instead of transmitting every single GUID, you get the OK from the generator to get GUIDs [fixed part].N to [fixed part].(N+1000) (that dot is concatenation). And you can query several generators at the same time.
(And since pseudorandom GUIDs can be predicted reasonably well, there is little reason not to use consecutive numbers).
Then what if you want to genereate 2^128+1 uuids in 1 second? The whole concept with generating "unique" values with a fixed bitlength is flawed.
"Then what if you want to genereate 2^128+1 uuids in 1 second?"
A system fast enough to generate them would be only six orders of magnitude away from doing operations in Planck time, which is the smallest meaningful unit of time under current known natural laws. There is no conceivable use for that many unique identifiers in such a short timespan. By the time hardware arrives that’s capable of this and scenarios that require it, though (if ever) we should have the GUID2, which will have 256 bits.
I agree that a fully general solution (with truly unique values) needs an infinite number of bits. Obviously, this approach comes with its own implementation issues, but in many cases it can be approximated by using the binary representation of whatever you want to assign a unique identifier to.
Not only that, but it also requires ridiculous amounts of memory to store all that IDs. Something like 500000 times the weight of Earth with today’s technology.
But it doesn’t matter, after all. Variable-length string IDs won’t be duplicated by Clone() method, just the same as fixed-length GUIDs will not be. No matter how long the identifier is, if it’s blindly propagated into copies without change, it won’t be unique in those copies.
Menno, I think it would be a bug if GUID didn’t implement Clone(). Don’t forget that sometimes (most of the time) you WANT the GUID to be copied. It’s only when you’re generating a new object from an old object with a GUID that you want a new one to be generated.
For example, if I have an object with GUIDs for identity and parentIdentity, I might want a copy of the object to have a new identity but I want it to always have the same parentIdentity.
“Not only that, but it also requires ridiculous amounts of memory to store all that IDs. Something like 500000 times the weight of Earth with today’s technology.”
And at least 6.9 zettajoule of energy. That’s 128 * 2^128 * 1 eV or almost 2 exawatt-hours or over 275,000,000,000 years of Three Mile Island’s reactor output.