PGP Corporation Logo
select United States productsPurchasedownloadssupportpartnersnewsroomcompanycareerscontact
.
.
 

Much ado about hash functions

03 Sep 2004
At this year's Crypto conference (held at the University of California , Santa Barbara), there was a good deal of excitement about mathematical functions called hash functions. Normally, Crypto is a pretty sedate conference; the mathematical papers presented there do not affect the industry for at least a few years. However, this one was exciting because of the flaws found in a number of hash functions and because of the way that news spreads on the Internet-fast, broadly, and often only partially correct. Now that the conference is over, I wanted to share my thoughts about what happened and how it affects you.

Before we get to that story, however, there's something very important to keep in mind: OpenPGP is an encryption standard for a reason. No other cryptographic system has been proven to be stronger and more reliable. No other crypto system undergoes the same level of scrutiny and peer review. No other crypto system. Period.

So what are hash functions?
Hash functions take an arbitrary chunk of data and produce from it a fixed-length shorter chunk. Consider both of these to be strings of letters or numbers. A number of them produce 16-byte strings, others produce 20-byte strings, and the newest ones produce strings up to 64 bytes long. The "hash" produced from the original string is supposed to be maximally different from any other hash. [A hash is a number generated from a string of text. The hash is much smaller than the text and is generated in such a way that it is unlikely some other text will produce the same hash value.] The hash function itself has its complexity in how it does that.

Hash functions are supposed to be one-way as well. This characteristic means it should be impossible to produce the original string given only the hash.

Hashes have a number of uses. For example, they're used in authentication protocols to avoid sending an actual secret such as a password. Instead of the password, you could send a hash of the password plus some sequence of numbers or a time stamp-all "hashed" together. The one-way nature of the function is important in the above application. Perhaps most important, hashes are used in digital signatures. Digital signatures are actually a signature of the hash of the document. We do this for speed, but we also do this for size. By signing a hash, the resultant digital signature has a small, fixed length no matter how large the signed document is. That means they can be processed quickly and are practical to use. (If digital signatures were as large as the original documents, they would not be used because they would double the size of the document that is being sent and signed.)

Hash function collisions
In a branch of mathematics called combinatorics, there is a basic axiom called The Pigeonhole Principle. At its simplest, The Pigeonhole Principle states that if you have 13 pigeons and only 12 pigeonholes, then at least one hole must contain at least two pigeons. Pretty obvious.

If you apply this principle to hash functions, consider 16-byte hashes. Also consider the entire set of 17-byte strings. According to The Pigeonhole Principle, there are going to be at least two original strings that product the same 16-byte hash. As a matter of fact, there have to be a whole lot of them. We call the situation where two pigeons (the original strings) end up in the same hole (have the same hash) a "collision."

Collisions are strange creatures because they have to exist. However, an ideal hash function should have no better way to find a collision than just keeping trying until you get one . Furthermore, colliding strings should be as different as possible. Let me explain that concept another way. You wouldn't want the string, "I promise to pay US$100," to hash to the same value as , "I'll buy you dinner at a Five Star restaurant the next time you're in town," or even to "The King is a fink." If any of those strings hash to the same value, then a digital signature of one is the same as a signature of the other-and we can't tell what the signature really belongs to. If an attacker can produce a hash collision, then the attacker can effectively forge a signature. The attacker can claim that the two pigeons are the same, as it were.

It might also be possible for an attacker who can make collisions at will to use the colliding string as an alternate password in an authentication protocol. Most hash-based authentication protocols have countermeasures to prevent this, however, I won't bore you with the details.

How hard is it to collide?
Ideally, if you take two pigeons and arbitrarily assign them each a hole, the odds that they are in the same hole is 1 in the number of holes. With 12 holes, that's a 1:12 chance. With a 16-byte hash function, the odds of a collision should be 1 in 2^128; with a 20-byte hash function, 1 in 2^160.

If you have a flock of pigeons that you assign to holes at random, the odds that you will get a collision is equal at approximately the square root of the number of holes. This is the same as a famous statistical problem called "The Birthday Problem." Simply stated, what is the probability that two people in a room full of people have the same birthday? If there are only two people in the room, it's pretty close to 1 in 365. I say "pretty close" because of leap year. In a room of 367 people, it's a certainty because of The Pigeonhole Principle.

The odds of two people having the same birthday are even when there are a mere 23 people in the room. The crossover point of when the odds are a wash is about at the square root of the total number of possibilities, birthdays, or pigeonholes.

This concept is important to hash functions because although the odds of any two given strings resulting in the same hash value is quite small, if you have a set of a lot of strings, the odds of any of them hashing to the same value is larger than one might expect: For a 16-byte hash (one with 128 bits), there's an even chance of a collision when you have a set of a mere 2^64 (or 18,446,744,073,709,551,616) strings. I hope you can tell I'm smiling when I say "mere." For a 20-byte (160-bit) hash, that's at 2^80, or 1,208,925,819,614,629,174,706,176.

Both of those are such large numbers that we cryptographers presume they're not going to happen often. 2^80 is about the number of water molecules in an ounce of water. It's a lot more than all the interesting things written down in the world. We don't care if uninteresting strings ("sd5iujhfkw473q0r") have the same hash as interesting strings ("I will gladly pay you Tuesday for a hamburger today"). We just want to make sure interesting strings don't collide.

This is why finding collisions is both not news and disturbing-at least to people like me. There's no nice mathematical definition of "interesting." It's one of those squishy human things that make the world , uh, well , interesting.

We cryptographers call "a birthday attack" an attack on a system that relies on this aggregation of large numbers of potential solutions and just hoping one of them works.

What happened at Crypto?
Eli Biham and Rafi Chen presented a paper in which they found near-collisions in the compression part of the hash function SHA-0. (Near collisions are ones in which some of the output bits are the same-they got 142 of 160.)

SHA-0 is a 20-byte function produced by the U.S. government in 1993. It was replaced by SHA-1 in 1995. We've never been told why it was replaced, but we all presume it is because SHA-0 had flaws. In 1998, Florent Chabaud and Antoine Joux published a paper describing some mathematical flaws in SHA-0, and we presume these are the ones that led to SHA-1.

Many early reports coming from Crypto mistook SHA-0 for SHA-1, and there was a buzz that this was an attack on SHA-1. No one uses SHA-0, so any flaws in it are not a direct threat. We've known it has problems for a decade-we're merely learning more about what they are.

Late-breaking news from Biham and Chen presented at a rump session (an informal session in which participants give short presentations on recent results, work in progress, and other topics of interest) was that they had produced collisions and near-collisions in reduced-round versions of SHA-1. SHA-1 has a total of 80 rounds, and they found collisions after only 42 rounds and near-collisions after 45 rounds. This revelation may have been what led to more confusion in the reports.

Finding results in reduced-round versions of functions is actually good news. In the life of a function, you expect to have reduced-round attacks, and the absence of them is an indication that a function hasn't received enough analysis. There is, for example, a reduced-round attack on the Advanced Encryption Standard (AES) at 7 of its 10 total rounds.

Antoine Joux, one of the pair who first found flaws in SHA-0, presented a paper on finding collisions in hash functions that are composed of multiple hash functions, first one than another. His paper is not yet available on the International Association for Cryptologic Research's (IACR's) eprint archive.

Joux's paper says that cascading hash functions do not give added security. He concludes, "In this paper, we have shown that multi-collisions in iterated hash functions are not really harder to find than ordinary collisions." The importance of this finding will be apparent in a moment.

In the rump session, Xiaoyun Wang presented results by herself, Dengguo Feng, Xuejia Lai, and Hongbo Yu. The results, but no paper, are available via the IACR website. They found collisions for the 16-byte functions MD4, MD5, HAVAL-128, and RIPEMD. (RIPEMD is not to be confused with RIPEMD-160. A number of reports from Crypto confused the two, and the researchers stress that they have no attack on RIPEMD-160.) They also note that they compute the work to find a collision in SHA-0 to be 2^40, and for HAVAL-160 to be 2^ 32.

Because we don't have their paper yet, we don't know how they're finding collisions. We do know that it's a good mechanism, however. In their MD4 break, they said that they can do it with a work factor of 2^2 to 2^6.

Yes, you saw that right, that was 2^2, commonly called 4 (or in English "four"), and 2^6, better known as 64 (sixty-four). They noted that in this case, the break can be done by hand-no computers required. We have known MD4 is broken for years, but that is an impressive result, indeed. They can't find collisions on an arbitrary starting string yet, maybe not ever, but they can throw pigeons into holes together now. It's conceivable that they'll be able to aim for a specific hole someday.

The last interesting rump session talk was by Philip Hawkes of Qualcomm, who discussed some of his results about SHA-256 that show some parts of it are weaker than one would expect; however, there is currently no way to turn this situation into collisions or near-collisions.

At dinner before the rump session, Adi Shamir (the "S" in RSA) said that he had always thought hash functions were the best-understood part of cryptography. Ron Rivest (the "R" in RSA and the author of MD4 and MD5) said that writing secure hash functions is easy-the trick is to write ones that are both secure and fast. He said he always knew MD4 was an aggressive design. We've known MD5 had flaws since Hans Dobbertin found them in 1996, but no break existed before Wang's.

All in all, these discussions made for quite the exciting week.

What does this mean to PGP users?
In PGP software, there are two types of keys: legacy RSA keys and new keys. The legacy keys use MD5 for their fingerprints as well as signatures. The new keys use SHA-1. We introduced the new key format in 1997 because of a number of issues with the old format, one of which was Dobbertin's paper on weaknesses in MD5. Now that Wang and her team can find collisions easily, that means the pigeons have come home to roost, so to speak.

If you have a legacy key, it's time to retire it. We show legacy keys in your keyring with an old-fashioned, silver-toothed key. We did this for a reason, and we mean it: Retire them. Now. Really.

Churchill said that there's no greater thrill than being shot at and missed. A number of attacks have whizzed by SHA-1, which we use for the present keys, but it hasn't been hit. It's still safe, and so are you.

Nonetheless, it's not unreasonable to start thinking now about what to do if SHA-1 gets hit.

We're working on that. In Joux's new paper, he says that cascading hash functions isn't safe. He also notes that some PGP software developers experimented with this technique and discarded it. This occurred in '96 and '97, and it never went into any products or the OpenPGP standard. We're feeling pretty good about that. With Phil Hawkes's work on SHA-256, however, it isn't clear that moving there immediately wouldn't be jumping from one issue that will benefit from more research to another issue that will benefit from more research. It's still safe where we are, so we're going to expend a lot of care and thought on providing you with new options.

I'll steal one last metaphor from cryptographer Susan Langford. The big bad wolf has blown down the house of straw. We're sitting in the house of sticks now. It might be strong enough-it's withstood everything the wolf's done so far. Besides, the wolf might have asthma. We're looking at brick construction, but some of the experts in brick construction are still debating about how to make bricks that are stronger than sticks, and other housing experts point out that brick construction isn't necessarily safe in an earthquake.

You will see changes from us soon. We'll offer options for people. We will all know more in a year or two, when there will no doubt be advances in the making and breaking of hash functions. Two things are clear, however: PGP products remain safe and we are now living in a very interesting time.

Background reading
Biham, Eli and Rafi Che, "Near-Collisions of SHA-0," International Association for Cryptologic Research (IACR)

Chabaud, Florent and Antoine Joux, "Differential Collisions in SHA-0"

Hawkes, Philip, Michael Paddon, and Gregory G. Rose, "On Corrective Patterns for the SHA-2 Family," International Association for Cryptologic Research (IACR)

.