# Encoding and Hashing of Long Objects with Poseidon
Poseidon is a hash function which takes a set of elements of certain field $\mathbb{F}$, called *scalars*, as inputs and outputs one scalar.
## 1. Inner Tree Functions
There are defined several instances of Poseidon to handle Merkle trees with arity $r$. Such an instance, denoted $H_r$, uses a single call to the Poseidon bijective transformation $\mathcal{P}$ of width $r+1$, assigning a fixed value to the first, also called *capacity*, element. To produce a parent hash in a tree, we take child hashes $h_1,h_2,\ldots, h_r$, and output $H_r(h_1,h_2,\ldots,h_r)$, which equals the second element of $\mathcal{P}(2^r-1,h_1,h_2,\ldots,h_r)$.
## 2. Computing the Tree
Suppose we want to compute a Merkle tree of size $r^t$, i.e. with $t$ layers and arity $r$ over a set of $r^t$ objects $A_1,A_2,\ldots,A_{r^t}$ from some object family $\mathcal{A}$ (for example, XML files, JSON objects, etc.). We proceed as follows:
1. **Encoding**: Represent each object $A_i$ as a vector of $k_i$ scalars $W_i=(W_i^1,W_i^2,\ldots, W_i^{k_i})$. Note that $k_i$ varies among objects. This encoding is deterministic (i.e. involves no randomness or counters) and is called *encoding schema* for $\mathcal{A}$.
2. **Hashing into leafs**: Apply a certain Poseidon hash function $H_{long}$, capable of processing arbitrary long scalar vectors to $(W_i^1,W_i^2,\ldots, W_i^{k_i})$ and obtain leaf value $L_i$.
3. **Computing the tree** Compute the Merkle tree over leafs $L_1,L_2,\ldots,L_{r^t}$ using $H_r$ for hashing $r$ leafs into parents, then $r$ parents into their parents, and so on.
## 3. Computing a Zero-knowledge Proof
A zero-knowledge proof about $A_i$ is always calculated and verified against its encoding $W_i$. A generic procedure is the following:
1. Formulate the assertion in terms of family $\mathcal{A}$.
2. Using the encoding schema for $\mathcal{A}$, translate the assertion about scalars $W_i^j$, for example
$$
W_i^1>0 \;AND \; \;AND \; W_i^2<10 \;AND \;W_i^3+W_i^4=5.
$$
3. Calculate the proof using some ZK proof system about $W_i$ and the leaf value $L_i$ that it is the hash of $W_i$.
4. To prove that $A_i$ is inserted in some leaf in the tree, additionally prove the knowledge of the path in the tree starting with $L_i$ and ending with the root.
## 4. Encoding Schemas
### 4.1 Principles
An encoding schema should follow these principles:
1. It should be injective: different objects should be converted to different scalar vectors.
2. It does not have to be cryptographically secure.
3. It should be portable.
4. It should take into account potential zero-knowledge proofs about the objects from $\mathcal{A}$. Concretely, elements that are asserted in proofs should go to distinct scalars and should not be mixed with elements that are not used in proofs.
`Example: a JSON object of form {month="January", day="31", year = "2010",name ="John"} can be represented as a tuple of at least 4 scalars, where the first scalar encodes the month from 1 to 12, the second one encodes the date from 1 to 31, the third one encodes the year from 0 to 7000, and the other represent the name. The name length in scalars can be quite long and vary among objects. Thus in order to assert that the date is between 10 and 20 of some month, it suffices to range-prove the second scalar. However, if we expect to prove something about a name, a more elaborate schema should be used in order to guarantee that the the name is contained in certain scalars.`
### 4.2 Schemas for Binary Objects
For objects that consist of bits we suggest the following simple schema:
1. Pad the binary string with bits 011, then with zero bits up to the multiple of 224.
2. Split the string into 224-bit chunks.
3. Assign each chunk to the scalar.
For bytes:
1. Pad the byte string with byte `0x7`, then with zero bytes up to the multiple of 28.
2. Split the string into 28-byte chunks.
3. Assign each chunk to the scalar.
For words:
1. Pad the word string with word `0xf`, then with zero bytes up to the multiple of 28 bytes.
2. Split the string into 28-byte chunks.
3. Assign each chunk to the scalar.
### 4.3 Schemas for Scalars
#### Same Field
Scalars of the same field $\mathbb{F}$ and vectors of scalars from $\mathbb{F}$ are left intact by the schema.
#### Another Field
Scalars of another field $\mathbb{F}'$ are treated as byte strings of fixed length $l$. For integer fields $\mathbb{F}' =\mathbb{Z}_s$ we define
$$
l = \lceil |s|/8 \rceil,
$$
where $|s|$ is the bitlength of $s$ (the minimum power of 2 not smaller than $s$).
The conversion to the byte string is little-endian. Then the schema for the byte strings is applied (see above).
### 4.4 Schemas for Complex Objects
We also outline a possible schema for 1-level objects that consists of elements of different types, including scalars and variable-length arrays. Consider the object of type $\mathcal{T}$ which contains fields $\{V_i\}$ where each $V_i$ is of type $T_i$. The rules are the following:
1. The first element of the final encoding is a scalar value of unique ID for the object type. It can be computed as
$$
ID=Scalar(SHA224(T_1,T_2,\ldots,T_i)).
$$
2. For each $i$ compose a tuple of scalars $\mathbf{W}_i = (W_{i_1},W_{i_2},\ldots, W_{i_{k_i}})$:
a. If $T_i$ has variable length, set $W_{i_1}\leftarrow length(V_i)$. Then apply the rules in Sections 4.2, 4.3 to get a scalar tuple $(W_{i_2},\ldots, W_{i_{k_i}})$.
b. If $T_i$ has fixed length, apply the rules in Sections 4.2, 4.3 to get a scalar tuple $(W_{i_1},\ldots, W_{i_{k_i}})$.
Example:
$$
\mathcal{T} = \mathrm{Struct}\{x: \mathrm{uint8}; y: \mathrm{Scalar}[]; z: \mathrm{uint256}; w: \mathrm{Scalar}; v: \mathrm{bytes}[33]\}
$$
Then the encoding is a variable length scalar tuple $E = (W_1,W_2,W_{31},W_{32},\ldots W_{3l_3}, W_{41},W_{42},W_{51},W_{61},W_{62})$, where
* $W_1 = Scalar(SHA224(``uint8``,``Scalar[]``,``uint256``,``Scalar``,``bytes[33]``))$;
* $W_2 = \mathrm{Scalar(Pad(Pad(byte(x),0x7)},0,0,\ldots,0));$ (26 zero bytes).
* $W_{31} = length(y)$;
* $l_3 = length(y)+1$;
* $W_{3j} = y[j-1], j>1$;
* $X =\mathrm{Pad(Pad(bytes[32](z),0x7)},0,0,\ldots,0));$ (23 zero bytes)
* $(X_1,X_2) = SplitChunk28(X)$.
* $W_{41} = Scalar(X_1)$;
* $W_{42} = Scalar(X_2)$;
* $W_{51} = w$.
* $Y=\mathrm{Pad(Pad((v),0x7)},0,0,\ldots,0));$ (22 zero bytes).
* $(Y_1,Y_2) = SplitChunk28(Y)$.
* $W_{61} = Scalar(Y_1)$;
* $W_{62} = Scalar(Y_2)$.
Note that this encoding is injective and invertible given $\mathcal{T}$. Note also that scalars in $\mathcal{T}$ are placed intact in specific locations, whose positions are fully defined by $W_{31}$, where the variable length of $y$ is stored.
## 5. Function for Leaf Hashing
We suggest the following function $H_{long}$ to hash an arbitrary long vector of scalars into a leaf value. Let $W=(W_1,W_2,\ldots,W_t)$ be scalar of length $t$. Then we proceed as follows:
1. Use the bijective Poseidon transformation $\mathcal{P}_5$ of width 5.
2. Pad $W$ with zero elements up to the multiple of 4, then split it into chunks $\mathbf{w}_1,\mathbf{w}_2,\ldots,\mathbf{w}_{\lceil t/4\rceil}$ of 4 elements each.
4. Compute iteratively:
$$
(h^1_1,h^2_1,h^3_1,h^4_1,h^5_1) = \mathcal{P}_5(2^{64}+l,\mathbf{w}_1)$$
$$
(h_i^1,h_i^2,h_i^3,h_i^4,h_i^5) = \mathcal{P}_5(h_{i-1}^1,h_{i-1}^2+w_i^1,h_{i-1}^3+w_i^2,h_{i-1}^4+w_i^3,h_{i-1}^5+w_i^4)
$$
where $\mathbf{w}_i=(w_i^1,w_i^2,w_i^3,w_i^4)$ and addition is in the field. Note that we set the capacity field into $2^{64}+l$. This guarantees that messages of different length have different input to the first $\mathcal{P}_5$ call.
5. Output $\mathcal{H}(\mathbf{x})=h_{last}[2]$.