Unless you have been under a rock you should have heard about the first public sha-1 collisions. If not, go to the page describing the collision and enjoy.
In this blogpost I’ll try to explain what this actually entails for most practical purposes.
A primer for newbies
SHA-1 is a cryptographic hash algorithm. Hash algorithms are basically a function that convert whatever you enter as input into an integer of a fixed size. For example, a hash function with an output size of up to 10 could map “Hello world” to 5 and “Hi world” to 2.
It isn’t hard to see that our example hash function has a problem in that at some point we’ll find two inputs returning the same result, these two inputs are what cryptographers call a collision.
In this regard, cryptographic hashes have two properties which make them more interesting. On one hand they make it difficult to find a collision, that is, finding two inputs that will result on the same number as output. It is also hard to find a preimage, i.e. an input which will result on a specific output.
Finally, there is one last concept you should know about hashes, they are usually state based. This means that instead of magically combining all their inputs together as a single item, they usually keep a small amount of data (called the state), split the input into small chunks of a fixed size, and then merge them one after another into the state to produce a new state. Going back to out previous example function, suppose we divide “Hello world” into groups of 4 letters, so first we’ll process “Hell”, then “o wo”, then “rld”, but since “rld” is only three characters long we need to pad the chunk so that it will be 4 characters long. A simple way of doing this would be, for example, by always adding a dot followed by as many spaces as needed to complete a block, so in our ase “rld” would become “rld.” whilst for the second input, d would become “d “.
If you have managed to follow the explanation up to now, it should be easy to explain you what has happened.
It’s not just Google’s public collision
First of all I’d like to make a note that this is not simply Google‘s collision. They worked along with the CWI as you’ll find out in the paper they published. In fact, although you may not know who is Marc Stevens, you may be interested in knowing he is one of the cryptographers who made the attack which later made it possible to create a fake CA using MD5, possible.
Now that we have clarified the attribution, let’s try to see what this collision entails.
The published collision
The published collision consists of a PDF file with an embedded JPEG image, the image is designed in a way so that the JPEG parser will jump over a different amount of bytes with each of the two colliding blocks which which makes it possible to insert two different images together by telling the parser to ignore the data in one of the two images. You can see this in a more graphical detail on marcan’s image. In fact you can even create your own colliding PDFs using, for example, this tool.
After the collision was published, some other approaches to use the collision have been appearing, you can for example see two different colliding HTML pages here and here. And I guess that if you look around you may find some other interesting
So far, these techniques build on the colliding PDFs published by google and therefore can be easily detected by looking for any files beginning with either of the two published prefixes (including the blocks causing the collisions).
What can and can’t be done with the published collision
For the published collision to work, you need the SHA-1 hash to apply from the beginning of the published PDF header up to the published collision blocks, after that you can put whatever data you want. This works because once SHA-1 has processed the prefix and the blocks causing the collision, it ends in the same internal state given the way in which SHA-1 processes it’s input. This requires though that SHA-1’s internal state is the initial state when processing the first chunk of data from the prefix, hence, why you can’t put any random data before the prefix (well, you actually can but calculating such data is harder than just calculating a new collision yourself), but you can put any data you want after the collision.
As previously said, this entails that detection of the published collisions is easy as all colliding files have to start with the currently published collisions. Also this limits the ability of producing other colliding files as you require the file’s parser to ignore or parse in a meaningful way the published prefix and collision blocks and then use them in a way so that two different results can be obtained. As you have already seen, it is possible to use this to generate colliding “broken” HTML files, it may also be possible to use the attack with other formats allowing back references to it’s own input where the parser ignores invalid input.
But, as said, the colliding prefix and block can’t be prepended with other data and the data following the colliding blocks has to be the same, the combination of both conditions makes it hard to extend the attack arbitrarily and also makes detection reasonably easy as both continuations have to be included in both files.
The main contribution made by the collision
Until this collision was published, only theoretical estimations on the cost of finding a collision on SHA-1 existed like the “at least 75.000$” figure published on this webpage. Now we have instead practical figures based on the real cost in CPU and GPU years that finding the collision costed to the team which makes it easier to estimate the risk that an adversary will try to exploit such a collision. Also we now know that carrying such an attack is indeed practically possible and is therefore something that has to be acted on.
How could new collisions be exploited?
A few different ways do exist, so far you have seen a way to exploit this in PDF files using some JPEG parsing features which entails a similar attack is also possible for plain JPEG files. In general, any file format where raw binary data can be introduced and where either the parser can be made to jump over an user defined number of bits or that can use that chunk as data on which to base a decision can be abused in such a way. This includes not only things like some image formats like JPEG or PNG, also other formats like AVI may be abused and even executable files could be abused. An example of executable files (what you most likely understand as programs) being abused (with MD5 collisions instead of SHA-1) can be seen here.
Despite that the effects of this are still pretty limited. As we have said before, both files have to be interleaved in some way for the attack to be possible, this may make detection possible in cases like images or videos, but for interpreted formats like executable files this may be harder to detect because the payloads may be encrypted using the colliding blocks as keys. Also tools exist to detect malicious SHA-1 collisions which make the detection of files trying to exploit the known SHA-1 weaknesses possible.
How can’t new collisions be exploited (yet)?
Unless a Certification Authority accepts arbitrary parameters on Certificate Signing Requests (i.e. requests to prove the authenticity of a certificate) and uses predictable serial numbers on it’s certificates it is unlikely that an attack like the one used to create a false CA will be able to use this technique, this is because that attack depended on being able to create collisions with different initial texts which is not possible (for now). That said, such an attack became possible around 4 years after the first MD5 collision was published so it may become feasible in the future (and assuming state level agents are 4 years ahead of us it may even be possible for them now). Also, (for now) it is still safe (although not a good idea) to use SHA-1 for CAs where you have full trust of the people requesting certificates, for example a VPN where you expedite both certificate and keys for your users, still you should consider moving towards other alternatives like SHA-256.
Also, algorithms like HMAC hash first the signature key making collisions harder as the attacker can’t know the hash state in advance. That said, other kind of cryptographic attacks against this function might appear on the future.
Finally, attacks on contructions for hashing passwords still depend on the weakness of the password. Even if the attacker could figure the SALT (or there was none which is a bad idea), the worst thing that could be done with this attack is generating two valid passwords for the attacker’s user, which in most cases won’t present a threat.
What to do now?
As a user, there isn’t much more you can do other than apply updates and misstrust any uses of SHA-1 you see for, for example, file integrity or signatures.
As a developer, you should try to avoid SHA-1 as you probably have avoided MD5 by now (unless you are a certain commit strip programmer). Instead, use options like SHA-2 and the emerging SHA-3. Some people recommend Blake but I feel it is attracting less attention from cryptanalysts than the other two (correct me if you know otherwise) and therefore won’t recommend it for things where hashes can’t be changed easily (and if they can you most likely have done migrated away from SHA-1 by now).
Also, even if the attacks don’t affect your specific use case, this is probably still a good excuse to move away towards better alternatives.
On one side, for integrity checking you can upgrade the hash algorithm you use to, for example, SHA-2 (use at least 256 bits) or SHA-3 (or Blake or some other secure hash algorithm). Alternatively, if you encrypt your data, consider using a AEAD cipher. Personally I like AES in SIV mode, but it may not be the most performant option.
On the other, for password hashing I strongly recommend the Argon2 algorithm as it is designed to make attacks hard even with advanced techniques like GPUs or FPGAs.
Also as usual, avoid writing your own cryptography, instead try to use well established libraries and make yourself sure you understand the security implications of the algorithms (even if high level like NaCl’s boxes) before you use them to avoid undesirable results.
Errors and typos
This article may contain, errors, typos misconceptions and lots else. I haven’t had time to proof read it yet and there may be a significant distance between what I’m trying to explain and what you understand. All corrections and clarifications are well appreciated, through any channels that can reach me 🙂
Aside from the usual suspects like my parents and family, or most of the nice security folks out there, I’d like to thank Tommock on whose house hosted Lan Party most of this article was written.