Beware the hash reset attack

Date:May 19, 2004 / year-entry #196
Orig Link:
Comments:    18
Summary:There are a variety of message digest algorithms out there, MD5 being a particularly popular one. These generate a "message digest" (essentially, a hash) so you can detect whether somebody has tampered with a file, the theory being that it's hard to tamper with a file without changing its hash. But make sure you record...

There are a variety of message digest algorithms out there, MD5 being a particularly popular one. These generate a "message digest" (essentially, a hash) so you can detect whether somebody has tampered with a file, the theory being that it's hard to tamper with a file without changing its hash.

But make sure you record the file size as well as the digest.

Not that collisions are necessarily easy to create by mistake. (I've heard a rumor that the deployment team has seen an MD5 collision, but it's just a rumor. I have no evidence. Heck, maybe what really happened was that somebody on the deployment got their MR2 into a car accident...)

Anyway, the possibility of a "reset attack" makes collisions trivial to create.

Hash generators typically operate on a stream. The hash engine maintains some state. The file to be hashed is broken up into chunks, and each chunk is combined with the engine's state variables in some complex way. When you have passed all the data through the engine, you push a button on the engine and out pops the hash value (which is typically a copy of the state variables, or possibly a subset of them).

Now suppose somebody came up with a way of "resetting" the engine; that is, returning it to the initial state. Here's how they can make any document match your digest:

First, create an alternate message and send it through the hash engine.

Next, generate the bytes necessary to "reset" the engine.

Finally, append the original message.

In other words, the fake file looks like this:

[alternate message][garbage][original message]

where "garbage" is the reset.

This fake file has the same hash as the original message, since the "garbage" resets the hash engine to the initial state, at which point the replay of the original message regenerates the hash.

Result: A file with the same hash as the original, but with different content!

In a proper attack, of course, the "alternate message" would be crafted so the garbage and original mesage would be ignored. You might end it with a marker that means "Ignore everything after this point." (For HTML, you can just say <NOFRAMES> and everything after that point will be largely ignored by all modern browsers.) Many other file types encode the expected file length in the header, in which case you can append whatever garbage you want without having any effect.

But if you also store the file size with the hash, then the reset attack fails, because a reset attack always generates a file bigger than the original. To create a collision, they would have to create a shorter alternate message than the original, and then fiddle with the extra bytes to get the desired target hash to come out. This is significantly harder than just resetting.

(I'm not aware of anybody who has successfully been able to reset MD5, mind you. This is a protective measure: If somebody figures out how to reset MD5, a small bit of work on your side will prevent you from falling victim.)

Comments (18)
  1. Nick Parker says:

    Very interesting, I wonder how often people check the file size along with the hash result to confirm the validity of their document.

  2. Keith Moore [exmsft] says:

    MD5 (and most other hash algorithms I’ve studied) pads the message to an integral number of blocks. This padding includes the length of the original message, so I really don’t see how this attack could possibly work.

  3. Keith Moore [exmsft] says:

    {Replying to my own post…}

    If len(alternate) + len(reset garbage) was exactly 2^64 (bits), then the reset could work, but I think people would notice this extra data.

  4. Raymond Chen says:

    If the length is just encoded in the padding, then just act as if the original message were pre-padded with its original length (so MD5 won’t add its own padding). Pad your replacement message, and (due to the block nature) the garbage will necessarily be a multiple of the block size, so no padding necessary there.

  5. Eric Lippert says:

    Here are some interesting notes on MD5.

    To my knowledge, no MD5 collisions have been found yet. Apparently in 1994 researchers estimated that brute-forcing a hash collision would take $10 million dollars of equipment 24 days. Assuming Moore’s law, and inflation that would be about $200000 worth of equipment today — and that’s to find A collision, not a particular collision that has evil properties.

    But of course, that’s conjecture, not proof. When I did the hashing algorithm for code signing on Windows Script Files I included the length of the file in the hash — it’s the right thing to do, because we don’t know for SURE whether the algorithm is effectively unbreakable.

  6. Keith Moore [exmsft] says:

    The reset is still possible (in theory) — it would require 2 exabytes of data to do it.

  7. Cooney says:

    Heck, maybe what really happened was that somebody on the deployment got their MR2 into a car accident…

    Not that hard to do in Seattle – midengine car + rain + some more rain = hours of fun.

  8. josh says:

    istm that the possibility of something in the datastream causing the hash to reset would be a major flaw in the code. The only case where that might be reasonable is if you had multiple files in one stream, and then you’d get two files and two hash values out, which defeats the point of this attack. Presumably whatever else you’re operating on the stream with will "reset" at this point too, or you have a bug there.

    So how can this possibly happen?

  9. Raymond Chen says:

    True, resettability would be a flaw in the algorithm, but algorithms have been known to be found flawed years after their development. My point is that with a tiny amount of additional work, you can significantly reduce your exposure if such a flaw were to be found.

  10. Keith Moore [exmsft] says:

    Hmmm… I don’t think this would work. MD5 *always* pads, and *always* includes the message bit length in the padding, even if the message is already a multiple of the block length.

    You’re trying to reset MD5 back to its initial state. Part of that state is the bit length of the message (stored in a 64-bit integer).

  11. asdf says:

    Are you saying to store the file length as part of the stream (so the algorithm hashes it also), to store the original file length in addition to the hash, or to do something else with the hash and file length?

  12. Raymond Chen says:

    Ah, you’re right, Keith; I missed that the size is *always* appended.

  13. Raymond Chen says:

    Save both the hash and the original file size.

  14. Jim Davis says:

    I’d like to think that would be noticable.

    "Say… this Half-Life mod is **HUGE**."

  15. Shawn says:

    I wrote a bit about this back in March ( — its an interesting attack, but as you pointed out, its very easily mitigated by simply knowing anything else about the data being hashed.

  16. Raymond Chen says:

    Oops, sorry, Shawn. Looks like my article is pretty much identical to yours!

  17. Next, generate the bytes necessary to "reset" the engine.

    Finding these bytes is as dificult as finding a message which will match the original.

    Start with hash algorithm that is preimage, and second preimage resistant, otherwise it isn’t useful in the first place. Given this, you want to find a string that will "reset" the algorithm to its initial state. Assume that you can do this, then you have found a message which will hash to the algorithm’s initial state. IE, you have found a message which hashes to an given hash value. IE, your algorithm is not preimage resistant, which contradicts the initial statement. Therefore your assumption is false. Therefore you cannot find the string that will "reset" the algorithm.

    I only see this attack as being usefull against an algorithm that is not already preimage resistant — which means that it was broken in the first place.

    Am I missing something here? Or are you allowing for broken implementations of the algorithm which can be reset?

  18. The original context of the internal discussion was around proper use of hash functions and what you should and shouldn’t use them for.

    The secure hash functions in use were designed to ensure that messages could not be easily tampered in transmission.

    One of the big differentiating factors between the secure hashes and other digest methods like checksums and CRCs is that they’re intended to be "cryptographically hard" to "break". I’ll leave "cryptographically hard" mostly undefined as some sort of reference to an NP-complete type of problem like curve fitting or factoring. Specifically, it’s easy to stay ahead of the brute-force solutions by extending sizes. I will define "break" as the ability to, with little or no effort, product another byte stream with the same hash value.

    Checksums, parity, CRCs and simple hashing algorithms like the ones you could find in any textbook are /insecure/ hashes. Meaning that they vary sufficiently for a typical data set but they also explicitly don’t attempt to make "breaking" the hash difficult. My personal favorite string hash function { hash = 0; for (i=0; i<l; i++) hash = (hash * prime) + string[i]; return hash; } is trivially breakable.

    Back to the original topic, there are a lot of people who want to have a pseudokey/hash where they can use it in a searching algorithm /instead/ of actually comparing the data. Well it’s true that if the hash values are different there’s no possible way that the bytestreams are equal, but that in no way implies that the pseudokeys being equal means that the bytestreams are equal.

    Which then comes back to the point. What are you going to do with this? I personally feel that this falls into the category of "if it doesn’t actually have to work right, I can make it as fast as you like".

    Secure hashes are great for their intended use. Given a message which must conform to some syntactic requriements (readable English, XML, ASN.1, whatever), it is computationally difficult to construct a message which still fits the syntactic requirements, has the same hash value and is different from the original message /in a useful way/.

    If you can change the message from "send the troops through the northern pass" to "send the troops via the southern river valley" easily, man, you’re in trouble. If the end result looks more like "send the troops via the southern river valley <big honking string of unreadable text>send the troops through the northern pass", well, it’s pretty obvious that there was tampering and the hash has accomplished its task.

    Expecting these algorithms to stand for arbitrary unformatted data is actually not reasonable in the long run. The birthday effect is likely to show its head before too long.

    Expecting these algorithms to give you a unique enough number that you don’t have to worry about collisions in a context which requires correctness is also unreasonable.

    Which then leads me to the point that if all you want is statistical uniqueness in order to decrease the number of times you have to compare byte streams that are different, you can use a significantly (computationally)

    cheaper hashing algorithm than one of the secure hash algorithms.

Comments are closed.

*DISCLAIMER: I DO NOT OWN THIS CONTENT. If you are the owner and would like it removed, please contact me. The content herein is an archived reproduction of entries from Raymond Chen's "Old New Thing" Blog (most recent link is here). It may have slight formatting modifications for consistency and to improve readability.

WHY DID I DUPLICATE THIS CONTENT HERE? Let me first say this site has never had anything to sell and has never shown ads of any kind. I have nothing monetarily to gain by duplicating content here. Because I had made my own local copy of this content throughout the years, for ease of using tools like grep, I decided to put it online after I discovered some of the original content previously and publicly available, had disappeared approximately early to mid 2019. At the same time, I present the content in an easily accessible theme-agnostic way.

The information provided by Raymond's blog is, for all practical purposes, more authoritative on Windows Development than Microsoft's own MSDN documentation and should be considered supplemental reading to that documentation. The wealth of missing details provided by this blog that Microsoft could not or did not document about Windows over the years is vital enough, many would agree an online "backup" of these details is a necessary endeavor. Specifics include:

<-- Back to Old New Thing Archive Index