Getting apples, bananas or cherries from hash functions !

This article is a follow-up to the [excellent blog post](https://crypto.junod.info/posts/recursive-hash/) written last year by Pascal Junod. This explains the strange title. The former post was about flaws regarding the lack of domain separation when hashing different types of data. In this new post, we explore related flaws we have found in the wild regarding implementations of hash functions when the result needs to lie in a specific range. # Refresh on domain separation As explained in Pascal Junod's post, domain separation is a way to construct different hash functions from the same primitive. It allows to avoid collisions when the same hash function is used to hash different data types or structured types. For example, if someone wants to hash the array **\[104, 101, 108, 108, 111\]** they may encode it as the bytes **\"68656c6c6f\"** and pass it to a standard hash function like SHA-3. The result will collide with the hash of the string **\"hello\"** or the array **\[6841708, 27759\]** which is undesirable for some applications. The usage of domain separation should avoid such flaws. This problem is still regularly found in deployed solutions; for example we described previously some flaws we have found during audits of [io.finnet](https://research.kudelskisecurity.com/2023/03/23/multiple-cves-in-threshold-cryptography-implementations/) and [Multisig Labs](https://uploads-ssl.webflow.com/62f90a8443126c2ee50f4c4e/643ee88bf91b375b39ba2613_Kudelski_multisig_labs_report_1.1.pdf) threshold cryptography implementations. In this post, we study the dual problem: What happens if we would like to hash values to a specific range like numbers less than a value $n$, to elliptic curve points or even to more complex types? # Hashing to a specific range Some modern constructions like [Identity-based](https://eprint.iacr.org/2001/090) encryption, [Verifiable Delay Functions](https://eprint.iacr.org/2018/623) (VDF), [e-voting](https://gitlab.com/swisspost-evoting/crypto-primitives/crypto-primitives), the [BLS digital signature](https://en.wikipedia.org/wiki/BLS_digital_signature) or the post-quantum [McEliece](https://classic.mceliece.org/spec.html) cryptosystem need a primitive to hash a value into a specific set. This function is usually not defined in the research paper but leaves the responsibility to the implementer of choosing it properly. In addition, the construction usually assumes that the hash function has the usual security properties of a standard hash function, namely, pre-image, second pre-image and collision resistance; additionally, for some constructions, the hash function should emulate a random oracle and thus output uniformly distributed random values. For example the function **SHA3-384** will output values which can be interpreted as an integer between $0$ and $2^{384}-1$. For any power of 2, an Extendable Output Functions (XOF) can be used, like [SHAKE256](https://www.nist.gov/publications/sha-3-standard-permutation-based-hash-and-extendable-output-functions) or even [cSHAKE](https://csrc.nist.gov/publications/detail/sp/800-185/final) to guarantee domain separation. The XOF generates any integer between 0 and $2^{d}-1$. For example in Python, if we want to hash a string to a number less than $2^{16}$, we can run: ```python= >>> import hashlib >>> s = b"Nobody inspects the spammish repetition" >>> int(hashlib.shake_256(s).hexdigest(2), 16) 17520 ``` However, how to hash a value to the interval $[0,q[$ with $q$ not beeing a power of two, for example being the BLS12-381 curve order ? Those problems have been well studied in the past and we will explain how some constructions made with secure hash functions can lead to non-secure results and which construction to use to solve this problem. # Modular bias (again) The naive approach would be to hash the value and take the result modulo $q$. However, the problem is that we often need values to be uniformly distributed over the whole range. Doing the modular reduction results in values not uniformly distributed as explained in a [previous blog post](https://research.kudelskisecurity.com/2020/07/28/the-definitive-guide-to-modulo-bias-and-how-to-avoid-it/), some values will be more probable than others. Depending of the size of the bias, it may be inadequate to some protocol using a hash function and assuming a random behavior. This is the case for protocols relying on the Fiat-Shamir heuristic. In this transformation, the hash function is assumed to behave as a random oracle. In addition, it does not solve the problem of hashing to a more complex set like the group of points on an elliptic curve. # Flawed Hunt-and-Peck methods A method similar to rejection sampling when generating random value is called the "Hunt-and-Peck method" or sometimes "Try-and-Increment method". It consists in hashing a value and if the digest does not fit to the desired output constraints, to hash it again until a satisfying value is found. A naive implementation of such method would be: ```python s = b"Nobody inspects the spammish repetition" q = 4002409555221667393417789825735904156556882819939007885332058\ 13612403165049083786444268762912901566403789427255978740024095552\ 21667393417789825735904156556882819939007885332058136124031650490\ 837864442687629129015664037894272559787 while True: h = hashlib.sha3_384(s).digest() h_int = int. from_bytes(h, "big") if h_int < q: break s = h print(f"hash: {h_int}") ``` We hash the input string with SHA3-384, we transform it into an integer, we compare it with $q$ (the BLS12-381 curve order) and if the value is lesser we output the hash value. At the end we have an integer within the desired range: ``` hash: 466802949991240959638695195289112782003214451371371177487434259700\ 090952460729664879919264417968882381626697562991 ``` However this new function is not second pre-image resistant even though SHA3-384 is secure. Indeed, if we print all the intermediate results of SHA3-384 in the *while* loop we obtain: ``` wp-block-syntaxhighlighter-code hash: 231372203699734843772658872435691913461004831297711563871351445474567697\ 12871740599682813528628879944752756968974946 hash: 479933895132222170400126679210260727863686076173104800018784427314421671\ 3087340391657809976969889746113658783263501 hash: 317169941799967137807456031128642415931560661070087672103355957468914115\ 79808647586409712124449922745051863682425891 hash: 330570197324021711673178918410949420892414812357689606202030224805207232\ 35635591889485122017075342314445605653261792 hash: 686813695028802607523226341653986065177792020510283281815506823883364444\ 6416649005650016224862851787880832921908269 hash: 227480650128345970455069049982372822326884577969625285091838702234552580\ 79719138292929491465597511441781142510793740 hash: 848766522283142174620564882709083272999040620310857079582168691849301258\ 5585156403884755993275197374472206949316440 hash: 118451297423370358310315505593349181773706597080155947819617869495610294\ 88376363251397691522368459978971541344556850 hash: 168369715510320221241132895080565397002290769099621706796604315076030674\ 22230552915330374164223401170221136111112554 hash: 466802949991240959638695195289112782003214451371371177487434259700090952\ 460729664879919264417968882381626697562991 ``` We have generated 9 different values larger than $q$ before getting a good one. It means that all these values result in the same hash results in our new hash function! Thus, we have generated 9 second pre-images of our initial input. This clearly violates the security of the construction. Some variant of this approach have been used in practice. For example, the Swiss Post E-voting system uses such primitive. This E-voting system was tested for a [real vote](https://www.admin.ch/gov/en/start/documentation/media-releases/media-releases-federal-council.msg-id-93455.html) in three Swiss cantons the 18th of June 2023 and is meant to be used in the whole country. All the specifications and source code have been [published](https://gitlab.com/swisspost-evoting) and the security has been studied since a long time through their bug [bounty program](https://yeswehack.com/programs/swiss-post-evoting). Here is the algorithm **RecursiveHashToZq** defined in the cryptographic primitives specification [version 1.2.0](https://gitlab.com/swisspost-evoting/crypto-primitives/crypto-primitives/-/blob/0e400d822991f6e3926a03ff97c556903d4afd01/Crypto-Primitives-Specification.pdf): <figure class="wp-block-image size-large"> <a href="https://cybermashup.files.wordpress.com/2023/06/screenshot-from-2023-06-06-18-18-36.png"><img src="https://cybermashup.files.wordpress.com/2023/06/screenshot-from-2023-06-06-18-18-36.png?w=1024" class="wp-image-18004" /></a> </figure> Basically, The algorithm takes the input message value **v**, hashes it with the function **RecursiveHashOfLength**, tests if the result value $h$ is less then $q$ and if not, it computes the hash of the value $h$ \|\| **v** and so on. Here again, we can construct pre-images for the hash function. If the first value of $h$ is bigger than $q$ then the value **v** and $h$\|\|**v** will give the same hash result. Here is a simple proof-of-concept demonstrating how to obtain a second pre-image: ```python q = 208646720849048248258460097566261786663962005742350872771529628203895944\ 3036820638293295694502177731152986839166585104319969544127560843565893181180\ 1075557016755326547964848188403580046184153559050760320954269217722044553380\ 7165266743597593043396227721772060174659057989924288170330155257675085227388\ 3121832393022719532676466024007375501293923559973411411023159932967006195736\ 6360354521094083860575767783432781332512368526927335941981975258740543166925\ 7568982557851651823658787952806752210181195048969155557292556412046938972855\ 0429711100348611756762750454111218085668726108805692250282799700614637315446\ 1039512873887585231304733406450360631940594775064432042460686829457515735478\ 1183156097956271090007493001241978406291955313717008646071415613450162553884\ 7778843271835472842295866285214247365533616992759565443813739393030477433853\ 7147836855560658006900761829400899136774168048282821143800293635878415814816\ 85094500000648811 v = b64decode("q83vASNFZ4k=") h1 = recursive_hash_zq(q, v) step = 253811875940797317181114679136866713124195493549505954863002468974765\ 5732678862557749208435969481107380814459665436536211726574891524612823116096\ 1310549001646156389668527396025834795621648234958460895562146538794149744985\ 5290258820883135053087103807884805497226677159521032185270960528528109452790\ 4412339617311242945847373101364946953130610527817606716680070281102317872706\ 0553122455780161831918273763460926323370771662133355863076347557139484603501\ 7568567545221519074605608969026404545964365360321688095492738943423567387085\ 8140902530600337552548992841814248314799389339609146507565655896887086919578\ 2974788304374575034741712296951406365410600835587476854120948254880853333532\ 8272301740066828867921849772378630759137955147663936340740986104556406524888\ 5305686707532686471051481169924771998589593761370455327660724258467435909252\ 0742024101839238933613804236523612949290743289457031885316690870363257458804\ 47022845110306798617 v2 = [step, v] h2 = recursive_hash_zq(q, v2) print(h1 == h2) ``` We chose **v** to be a message used in the tests of the library encoded in Base64 and the **step** value is an intermediate step we have found from the first computation of the hash **h1** from the function *recursive_hash_zq*. Then, we obtained a second hash **h2** from a different input **[step,v]** which collides with the first hash. Thus we have the same problem here as before. This [problem](https://gitlab.com/swisspost-evoting/e-voting/e-voting-documentation/-/issues/46) has been reported to Swiss Post and has been [patched](https://gitlab.com/swisspost-evoting/crypto-primitives/crypto-primitives-ts/-/commit/0aeb7f1c570d4e3b9ac8bdd7bb771a9018025334) quickly. Now, the function uses a XOF to hash the message to an integer between $[0,2^{\log_2(q)+256}[$ and then reduces it modulo $q$. There is still a bias as explained before but this is a bias of order $\frac{1}{2^{256}}$ which is too small to be exploited and coherent with the level of security of the whole system. This method is described in the IETF Draft [Hashing to Elliptic Curves](https://www.ietf.org/archive/id/draft-irtf-cfrg-hash-to-curve-16.html#name-hashing-to-a-finite-field) with the method called **hash_to_field** and in "Appendix B—Hashing into a Range (Informative)" of the [SHA-3 Derived Functions](https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=922422&p=22#%5B%7B%22num%22%3A81%2C%22gen%22%3A0%7D%2C%7B%22name%22%3A%22FitH%22%7D%2C729%5D) specifications. This problem have also be [found](https://github.com/drand/kyber/blob/300068b1379269909855c8383c5258d766209798/group/mod/int.go#L433) in the Kyber Crypto Library for Go during our audit of the timelock encryption. It has been corrected with a solution described after. This is also used in the [Classic McEliece](https://classic.mceliece.org/) public-key cryptosystem to generate a private key. In this case, a 256-bit seed is used to generate the private key which is much larger. If such generation fails, the seed is hashed again until the private key is properly generated. It means that different seeds will generate the same private key at the end. However this issue have been [found](https://classic.mceliece.org/mceliece-security-20221023.pdf#page=24) to be not harmful by the Classic McEliece team since it reduces only the entropy of the private key to 254 bits. # Non-constant time method The previous "Hunt-and-Peck" method can be implemented in a more robust way. For example, the BLS signature uses an algorithm called **MapToGroup** to map a message to a point of an elliptic curve subgroup. It is defined as followed: <figure class="wp-block-image size-large"> <a href="https://cybermashup.files.wordpress.com/2023/06/screenshot-from-2023-06-14-14-24-01.png"><img src="https://cybermashup.files.wordpress.com/2023/06/screenshot-from-2023-06-14-14-24-01.png?w=941" class="wp-image-18120" /></a> </figure> Basically a hash value is computed from the message $M$ and the iteration number $i$ until a valid point on the elliptic curve is found. Since the iteration number is concatenated with the message, this prevents a second pre-image attack as before. However, the iteration number has to be encoded on **I** bits; otherwise the previous problem of domain separation may arise. This construction works and is proven to be secure in the random oracle model in the paper. However, this approach is not constant time since the total number of iterations depends on the message to hash, and it may lead to timing attacks. For example, the Dragonfly handshake used by WPA3 uses this way of mapping a password to an elliptic curve point. This leads to [side channel leakage](https://wpa3.mathyvanhoef.com/) of the password used to authenticate a client on a WiFi network. This is why the usage of the Hunt-and-Peck method is not recommended by the [IETF draft](https://www.ietf.org/archive/id/draft-irtf-cfrg-hash-to-curve-16.html#name-hashing-to-a-finite-field). # Conclusion Getting a uniformly distributed value in a specific range from a hash function can be tricky and often, custom solutions have flaws leading to insecure constructions. If you need to implement such methods, a constant time solution to hash value to a finite field is to use the method called **hash_to_field** defined in the IETF draft "Hashing to Elliptic Curves". To hash values to a group of elliptic curve points, the [determinsitic mappings](https://www.ietf.org/archive/id/draft-irtf-cfrg-hash-to-curve-16.html#name-deterministic-mappings) defined in the same IETF draft is a good option. To obtain hash results in a more complex set, the **MapToGroup** solution with the iteration number hashed together with the value is an option but this construction may suffer from timing attacks, depending on the security context. *I would like to thank Nils Amiet and Pascal Junod for their valuable comments on this blog post.*