Lately I have found myself reading the paper on Poseidon hash function available here. In essence, this came from the need to understand why this particular hash function is used so widely in the domain of zero-knowledge - what makes it special, and why others just don't cut it (or maybe they do)?
What happened to all the well known hashing algorithms such as MD5, SHA256, Keccak256 (the ethereum one)? Somewhere I also heard of something called MiMC. What really is that?
The text here is basically my findings while I was struggling with the above question. And if you are trying to find some info on the same, maybe just read on.
PSA: While attempts will be made to understand "WHY" such constructions are used, they are not investigated via any deep cryptographic analysis. I am not a cryptographer, nor do I want to sound like one.
A cryptographic hash function is a mathematical algorithm that takes an input "message" (of variable length possibly) and produces a fixed-size string of characters, which is typically a sequence of numbers and letters. The output, often referred to as the hash value or hash code, is unique to the input data. Even a small change in the input data should result in a significantly different hash value.
In a very general sense, we want the cryptographic hash functions to be practically one-way. This means that given some message known such that after application of hashing function , the hash generated is such that finding via reversing the function is practically impossible.
When we say something is "practically impossible", we mean that there is a significant amount of effort that needs to go into finding the pre-image from in terms of computational resources that we are "practically safe". We can measure this for different hashing systems by their defined "bits of security". If some hash function advertises bits of security, it means that on average to find some (or even some other , not necessarity ) that gives value after application of as takes effort in terms of compute. This hardness of not being able to find is called pre-image resistance and the hardness of finding such that is called the collision-resistance property of a hash function.
Wikipedia says:
MD5 is one in a series of message digest algorithms designed by Professor Ronald Rivest of MIT (Rivest, 1992). When analytic work indicated that MD5's predecessor MD4 was likely to be insecure, Rivest designed MD5 in 1991 as a secure replacement. (Hans Dobbertin did indeed later find weaknesses in MD4.)
Further, the same article says:
In 1996, a flaw was found in the design of MD5. While it was not deemed a fatal weakness at the time, cryptographers began recommending the use of other algorithms, such as SHA-1, which has since been found to be vulnerable as well. In 2004 it was shown that MD5 is not collision-resistant.
Clearly, this is not suitable for use in modern times. But still, this may provide us with insights on how such hashing systems were envisioned in the minds of cryptographers of the day. Furthermore, these would lead us to contrast such constructions with new constructions that we tend to see in use today.
More on MD5 can be found here: https://www.okta.com/identity-101/md5
However, the highlight in terms of security remains that in modern times, MD5 hash can be broken, as in some or can be found from within seconds.
However, we find a nice information from MD5, something called Merkle–Damgård construction.
The Merkle–Damgård construction is a general structure of creating a hash function which works as follows:
With the above in perspective, have a look (shamefully copied from wikipedia) image below:
A very naive approach towards padding may be just adding a bunch of zeros at the end of message such that the length becomes a multiple of . This can be though of as follows:
However imagine what happens when we get two messages
and pad it using the above setup. Both will result in the same value. In other words, any message having extra zeros at the end makes it indistinguishable from the one without them
How should one fix this? We can simply use a random value (let's just say 0x67
) as the starting value of padding bits. This would result in following:
However, modern implementations also encode the length of original message in final byte (In real life, this maybe too limiting, a single byte can at max accomodate message length not exceeding 255; but, we can always use modulo operation ;) ). In our case, if we just accomodate a single byte for such encoding, we get the following:
In essence due to above considerations, there are reasons we moved from padding bytes 00 00 00
to 67 00 05
.
Merkle–Damgård construction as described above shows us an algorithm that is inherently sequential. We can make it parallel by in turn having a merkle-tree like construction with each leaf being composed into final hash as below:
This is premise of a SHA-256 based parallel hashing function called PARSHA-256 (Parallel SHA-256). You can read more about it here.
MD5 processes a variable-length message into a fixed-length output of 128 bits. The rate used is 512 (bits). The message hence, is padded to make the input length divisible by 512. Similar to how the padding steps are described above, first, a single bit, 1, is appended to the end of the message. This is followed by as many zeros as are required to bring the length of the message up to 64 bits fewer than a multiple of 512. The remaining bits are filled up with 64 bits representing the length of the original message, modulo .
To understand the setup better, have a look at the following image:
MD5 maintains a "state" of 128-bits at all times. These 128-bits can be broken down into "words" of size 32-bits. Hence, there are 4 of them in state. We call them through which you see above.
Messages as we know are padded and split into chunks of sizes bits. These are essentially 16 32-bit words.
For every single message chunk of size a 4-staged algorithm is run where message chunk of size is used to affect "state" of 128-bits.
Each of the four stages use a different permutation of the 16 32-bit words in the padded message, and also a different equation for box above.
Since there are four stages, and every stage has to consume 16 32-bit words, in total the setup shown in above image is ran 64 times per chunk. are the round constants and hence 64 in numbers. They are calculated using the following equation:
where is in radians.
At the very beginning, the following values are used for , , and in the spirit of initialization-vector (IV) of Merkle–Damgård construction.
Afterwards, the algorithm uses the basic setup:
All the different rounds round1
, round2
, round3
and round4
do the simple shift as follows:
This is evident in the figure given above. However, the way they are different is the definition of function being used and the permutation of message chunk's words.
To explain permutations better, recall that each message chunk was 16 32-bit long words. These could be layed out as follows c[0]..c[15]
. This sequential layout may be called a "trivial" permutation. Another permutation may look like: c[0], c[2], c[4], .., c[14], c[1], c[3], .., c[15]
, basically all even indexed elements before any odd indexed ones. This is another permuation.
Given this context, the different rounds use the following:
That's it, that's the whole of MD5 algorithm.
Flame story: https://en.wikipedia.org/wiki/Flame_(malware) counterfeit