Date: | May 19, 2004 / year-entry #196 |
Tags: | other |
Orig Link: | https://blogs.msdn.microsoft.com/oldnewthing/20040519-00/?p=39263 |
Comments: | 18 |
Summary: | There are a variety of message digest algorithms out there, MD5 being a particularly popular one. These generate a "message digest" (essentially, a hash) so you can detect whether somebody has tampered with a file, the theory being that it's hard to tamper with a file without changing its hash. But make sure you record... |
There are a variety of message digest algorithms out there, MD5 being a particularly popular one. These generate a "message digest" (essentially, a hash) so you can detect whether somebody has tampered with a file, the theory being that it's hard to tamper with a file without changing its hash. But make sure you record the file size as well as the digest. Not that collisions are necessarily easy to create by mistake. (I've heard a rumor that the deployment team has seen an MD5 collision, but it's just a rumor. I have no evidence. Heck, maybe what really happened was that somebody on the deployment got their MR2 into a car accident...) Anyway, the possibility of a "reset attack" makes collisions trivial to create. Hash generators typically operate on a stream. The hash engine maintains some state. The file to be hashed is broken up into chunks, and each chunk is combined with the engine's state variables in some complex way. When you have passed all the data through the engine, you push a button on the engine and out pops the hash value (which is typically a copy of the state variables, or possibly a subset of them). Now suppose somebody came up with a way of "resetting" the engine; that is, returning it to the initial state. Here's how they can make any document match your digest: First, create an alternate message and send it through the hash engine. Next, generate the bytes necessary to "reset" the engine. Finally, append the original message. In other words, the fake file looks like this: [alternate message][garbage][original message] where "garbage" is the reset. This fake file has the same hash as the original message, since the "garbage" resets the hash engine to the initial state, at which point the replay of the original message regenerates the hash. Result: A file with the same hash as the original, but with different content! In a proper attack, of course, the "alternate message" would be crafted so the garbage and original mesage would be ignored. You might end it with a marker that means "Ignore everything after this point." (For HTML, you can just say <NOFRAMES> and everything after that point will be largely ignored by all modern browsers.) Many other file types encode the expected file length in the header, in which case you can append whatever garbage you want without having any effect. But if you also store the file size with the hash, then the reset attack fails, because a reset attack always generates a file bigger than the original. To create a collision, they would have to create a shorter alternate message than the original, and then fiddle with the extra bytes to get the desired target hash to come out. This is significantly harder than just resetting. (I'm not aware of anybody who has successfully been able to reset MD5, mind you. This is a protective measure: If somebody figures out how to reset MD5, a small bit of work on your side will prevent you from falling victim.) |
Comments (18)
Comments are closed. |
Very interesting, I wonder how often people check the file size along with the hash result to confirm the validity of their document.
MD5 (and most other hash algorithms I’ve studied) pads the message to an integral number of blocks. This padding includes the length of the original message, so I really don’t see how this attack could possibly work.
{Replying to my own post…}
If len(alternate) + len(reset garbage) was exactly 2^64 (bits), then the reset could work, but I think people would notice this extra data.
If the length is just encoded in the padding, then just act as if the original message were pre-padded with its original length (so MD5 won’t add its own padding). Pad your replacement message, and (due to the block nature) the garbage will necessarily be a multiple of the block size, so no padding necessary there.
Here are some interesting notes on MD5.
http://www.rsasecurity.com/rsalabs/node.asp?id=2253
To my knowledge, no MD5 collisions have been found yet. Apparently in 1994 researchers estimated that brute-forcing a hash collision would take $10 million dollars of equipment 24 days. Assuming Moore’s law, and inflation that would be about $200000 worth of equipment today — and that’s to find A collision, not a particular collision that has evil properties.
But of course, that’s conjecture, not proof. When I did the hashing algorithm for code signing on Windows Script Files I included the length of the file in the hash — it’s the right thing to do, because we don’t know for SURE whether the algorithm is effectively unbreakable.
The reset is still possible (in theory) — it would require 2 exabytes of data to do it.
istm that the possibility of something in the datastream causing the hash to reset would be a major flaw in the code. The only case where that might be reasonable is if you had multiple files in one stream, and then you’d get two files and two hash values out, which defeats the point of this attack. Presumably whatever else you’re operating on the stream with will "reset" at this point too, or you have a bug there.
So how can this possibly happen?
True, resettability would be a flaw in the algorithm, but algorithms have been known to be found flawed years after their development. My point is that with a tiny amount of additional work, you can significantly reduce your exposure if such a flaw were to be found.
Hmmm… I don’t think this would work. MD5 *always* pads, and *always* includes the message bit length in the padding, even if the message is already a multiple of the block length.
You’re trying to reset MD5 back to its initial state. Part of that state is the bit length of the message (stored in a 64-bit integer).
Are you saying to store the file length as part of the stream (so the algorithm hashes it also), to store the original file length in addition to the hash, or to do something else with the hash and file length?
Ah, you’re right, Keith; I missed that the size is *always* appended.
Save both the hash and the original file size.
I’d like to think that would be noticable.
"Say… this Half-Life mod is **HUGE**."
I wrote a bit about this back in March (http://blogs.msdn.com/shawnfa/archive/2004/03/05/84799.aspx) — its an interesting attack, but as you pointed out, its very easily mitigated by simply knowing anything else about the data being hashed.
Oops, sorry, Shawn. Looks like my article is pretty much identical to yours!
The original context of the internal discussion was around proper use of hash functions and what you should and shouldn’t use them for.
The secure hash functions in use were designed to ensure that messages could not be easily tampered in transmission.
One of the big differentiating factors between the secure hashes and other digest methods like checksums and CRCs is that they’re intended to be "cryptographically hard" to "break". I’ll leave "cryptographically hard" mostly undefined as some sort of reference to an NP-complete type of problem like curve fitting or factoring. Specifically, it’s easy to stay ahead of the brute-force solutions by extending sizes. I will define "break" as the ability to, with little or no effort, product another byte stream with the same hash value.
Checksums, parity, CRCs and simple hashing algorithms like the ones you could find in any textbook are /insecure/ hashes. Meaning that they vary sufficiently for a typical data set but they also explicitly don’t attempt to make "breaking" the hash difficult. My personal favorite string hash function { hash = 0; for (i=0; i<l; i++) hash = (hash * prime) + string[i]; return hash; } is trivially breakable.
Back to the original topic, there are a lot of people who want to have a pseudokey/hash where they can use it in a searching algorithm /instead/ of actually comparing the data. Well it’s true that if the hash values are different there’s no possible way that the bytestreams are equal, but that in no way implies that the pseudokeys being equal means that the bytestreams are equal.
Which then comes back to the point. What are you going to do with this? I personally feel that this falls into the category of "if it doesn’t actually have to work right, I can make it as fast as you like".
Secure hashes are great for their intended use. Given a message which must conform to some syntactic requriements (readable English, XML, ASN.1, whatever), it is computationally difficult to construct a message which still fits the syntactic requirements, has the same hash value and is different from the original message /in a useful way/.
If you can change the message from "send the troops through the northern pass" to "send the troops via the southern river valley" easily, man, you’re in trouble. If the end result looks more like "send the troops via the southern river valley <big honking string of unreadable text>send the troops through the northern pass", well, it’s pretty obvious that there was tampering and the hash has accomplished its task.
Expecting these algorithms to stand for arbitrary unformatted data is actually not reasonable in the long run. The birthday effect is likely to show its head before too long.
Expecting these algorithms to give you a unique enough number that you don’t have to worry about collisions in a context which requires correctness is also unreasonable.
Which then leads me to the point that if all you want is statistical uniqueness in order to decrease the number of times you have to compare byte streams that are different, you can use a significantly (computationally)
cheaper hashing algorithm than one of the secure hash algorithms.