# Poseidon in Filecoin
**Review by ADBK Consulting**
*Mikhail Vladimirov and Dmitry Khovratovich*
## Notation
Let $\mathbb{F}$ be the scalar field of the BN254 curve prime subgroup. Let $x_{i:j}$ denote a $(j-i+1)$-element vector $(x_i,x_{i+1},\ldots,x_j)$.
The Poseidon [paper](https://eprint.iacr.org/2019/458.pdf) defines (for various $\mathbb{F}$) a certain bijective transformation from $\mathbb{F}^n$ to $\mathbb{F}^n$ called the *Poseidon permutation*, which we denote by $\mathcal{P}_n$. Let us denote by $\mathcal{P}_n[2]$ the second element of the $\mathcal{P}_n$ output. Based on that, several hash functions have been defined in the paper.
## Poseidon Hash Functions in Filecoin
The Filecoin protocol makes use of several hash functions based on the Poseidon permutation $\mathcal{P}_n$ with different $n$.
### Poseidon Permutation
The permutations $\mathcal{P}_n$ are implemented correctly in Filecoin, which we checked thoroughly. In particular, optimized versions of $\mathcal{P}_n$ with sparser matrices in linear transformations are implemented, and we have verified that all the optimizations are correct and follow the strategy outlined in the original paper. We also checked that the nonlinear transformation and the constant addition are implemented correctly. The constant generation is implemented differently from the reference implementation, since the original paper is ambiguous about the endianness of constant bits.
### Poseidon Hash Functions
Filecoin implementation creates several hash functions, which we name according to the code:
1. The long-input function $H_{MD}(x_{1:t})$ which operates as follows:
a. Pad the input with zero $\mathbb{F}$ elements until $t=1\pmod{35}$.
b. Return
$$
\mathcal{P}_{37}[2](36,\ldots(36, \mathcal{P}_{37}[2](36,\mathcal{P}_{37}[2](36,x_{1:36}),x_{37:71}),\ldots),x_{(t-34):t}).
$$
2. Fixed-input hash functions $H_t$
$$
H_t(x_{1:t}) = \mathcal{P}_{t+1}[2](t,x_{1:t})
$$
## Issues in Poseidon-Based Hash Function Design
We have identified the following design issues:
1. The $H_{MD}$ function has **length extension problem**. Concretely, given $h = H_{MD}(x)$, where length of $x$ is $1 \pmod{35}$, it is possible to compute $H_{MD}(x,y)$ for any $y$ even when $x$ is unknown. This is a security issue when $x$ contains some secret information and a user is not supposed to be able to compute the hash if he does not know it. This might not be a problem if all messages have the same length by the protocol, but this is not enforced and might not be true in future versions of the code or protocol.
It is possible to fix this by reusing the first output element of $\mathcal{P}$ instead of setting it to 36, but we recommend a different scenario (see below).
2. The $H_{MD}$ function is **not collision resistant**. Indeed, for any $x$ longer than 36 elements we have
$$
H_{MD}(x_{1:36}x_{37:\ldots}) = H_{MD}(wx_{37:\ldots}),
$$
where $w=H_{MD}(x_{1:36})$. The same fix as above would apply here.
## Issues in Poseidon-based Hash Function Usage
We have identified the following usage issues:
1. The $H_{MD}$ function is too wide. This results in big memory usage and increasing computation time, as well as a lot of round constants and matrix elements in the code. We suggest a generic $\mathcal{H}$ construction (see below).
2. The data sent to $H_{MD}$ is padded twice, first in the [caller code](https://github.com/filecoin-project/rust-fil-proofs/blob/tnet2/storage-proofs/post/src/election/vanilla.rs#L204) and then in the [Poseidon internals](https://github.com/filecoin-project/rust-fil-proofs/blob/tnet2/storage-proofs/core/src/hasher/poseidon.rs#L269). Probably the first padding is redundant.
3. An instance $H_{11}$ of Poseidon of width 12, used for column hashing, is also too wide. We recommend implementing via $H_{MD}$ as above, which would require only 3 calls to $\mathcal{P}_5$ of width 5.
4. Merkle trees of arity greater than 4 are used, in particular the arity 8. We guess this is not optimal in terms of performance due to large width of transformation $\mathcal{P}$. We advise using a tree of the same arity everywhere, for instance 2 or 4. Note that a leaf can be a hash of an arbitrarily long object, for which the $\mathcal{H}$ hash can be used (or the fixed $H_{MD}$.
## Optimal Variable- and Fixed-Length Poseidon Hash
In order to process an arbitrary long string $\mathbf{x}$ of $\mathbb{F}$ elements, the following function $\mathcal{H}$ can be used:
1. Use permutation $\mathcal{P}_5$.
2. Let $l$ be the length of $\mathbf{x}$.
3. Pad the resulting string with zero elements up to the multiple of 4, then split it into chunks $\mathbf{w}_1,\mathbf{w}_2,\ldots,\mathbf{w}_k$ of 4 elements each.
4. Compute iteratively:
$$
(h^1_1,h^2_1,h^3_1,h^4_1,h^5_1) = \mathcal{P}_5(2^{64}+l,\mathbf{w}_1)$$
$$
(h_i^1,h_i^2,h_i^3,h_i^4,h_i^5) = \mathcal{P}_5(h_{i-1}^1,h_{i-1}^2+w_i^1,h_{i-1}^3+w_i^2,h_{i-1}^4+w_i^3,h_{i-1}^5+w_i^4)
$$
where $\mathbf{w}_i=(w_i^1,w_i^2,w_i^3,w_i^4)$ and addition is in the field. Note that we set the `arity tag` field into $2^{64}+l$. This guarantees that messages of different length have different input to the first $\mathcal{P}_5$ call.
5. Output $\mathcal{H}(\mathbf{x})=h_{last}[2]$.
This construction has several advantages:
1. It works well for messages of all lengths, including short messages, where the overhead is minimal. It is thus possible to have a single constant set for the entire code.
2. One may hash the first chunks of a long message before the last ones are available.
## Other Issues
1. The [`new_with_preimage`](https://github.com/filecoin-project/neptune/blob/v0.5.4/src/poseidon.rs#L158) method does not se the `pos` field properly.
2. The code contains three alternative implementations of Poseidon hash function. One one of these implementations is actually used, but all three are reachable. Probably unused implementation should be removed or moved to tests, to make code simpler and easier to read.