Status | Stable |
---|---|
Version | 2021-03-22 |
Permalink | https://hackmd.io/@samoht/tezos-context-hash |
Contact | https://github.com/tarides/tezos-context-hash/issues |
This document provides a specification for computing the encoding of each kind of objects present in the Tezos context. Tezos then applies 32-bytes BLAKE2B-256 to compute context hashes from these encodings. Moreover, hashes are displayed using a base 58 representation and are prefixed by Co
and suffixed by a 4-byte checksum. For instance, CoVGWKM7Ufu6dk74CEQz3MgffhUPFyeaMCD6eS3Q8o7mDis8n1Vi
is a valid context hash for Tezos.
Notations:
blake2b(s)
is the BLAKE2B-256 hash of the string s
.encode(v)
is the raw sequence of bytes representing the encoding of v
.hash(v) = blake2b(encode(v))
is the cryptographic hash of the object v
.len(s)
is the length of the string s
(in bytes).This documents uses a graphical notation to represent encodings:
s
an encoding and n = len(s)
be its size in byte. This is represented as:
n |
---|
s |
s
and t
be 2 encodings and n = len(s)
and m = len(t)
be their respective sizes (in bytes). The concatenations
and t
is represented as:
n | m |
---|---|
s | t |
The Tezos context has two modes for encoding integers:
Notations:
\x
instead of the binary representation of the integer x
, e.g. \32
instead of \000\000\000\000\000\000\000\032
for fixed-length encoding or \032
for variable-length encoding.Example: encoding the integer 1298532
.
8 |
---|
\000\000\000\000\000\019\208\100 |
(LEB128) |
---|
\228\160\079 |
Contents are raw sequences of bytes. Encoding contents is done by concatenating their length, encoded on 8 bytes, along with their value:
8 | len(v) |
---|---|
\len(v) |
v |
Example: encoding the contents "delphi_007"
.
8 | 10 |
---|---|
\000\000\000\000\000\000\000\010 |
delphi_007 |
Commits are immutable objects that contain metadata about changes done to the context, as well as pointers to previous commits and to the current Merkle trees root. These objects allow Tezos to track the history of all the changes done to the context.
Commits metadata are triplet: date int64
. Author is usually the string "Tezos"
but could change in the future. Message is a string containing information about the corresponding Tezos block such as its level, fit, priority and number of operations.
A commit metadata is encoded as follows:
8 | 8 | len(author) |
8 | len(message) |
---|---|---|---|---|
\date |
\len(author) |
author |
\len(message) |
message |
Example: encoding the commit metadata with date 1612521119 and message "msg"
.
8 | 8 | 5 | 8 | 3 |
---|---|---|---|---|
\000\000\000\000\039\029\030\159 |
\000\000\000\000\000\000\000\005 |
Tezos |
\000\000\000\000\000\000\000\03 |
msg |
Commits are triplet: tree
8 | 32 | 8 | 8 | 32 | … | 8 | 32 | m |
---|---|---|---|---|---|---|---|---|
\32 |
tree |
\k |
\32 |
… | \32 |
encode(metadata) |
where m = len(encode(metadata)).
Example: encoding the commit with tree hash <tree>
, one parent <parent>
and the metadata encoded by <m>
.
8 | 32 | 8 | 8 | 32 | len(m) |
---|---|---|---|---|---|
\000\000\000\000\000\000\000\032 |
<tree> |
\000\000\000\000\000\000\001 |
\000\000\000\000\000\000\000\032 |
<parent> |
<m> |
The Tezos context contains Merkle trees. Contents encoding has already been explained in a previous section. This section presents the encoding for internal nodes of the tree. In Tezos, tree nodes are equivalent to a list of entries, pointing to the immediate children of the node.
A tree entry is a triplet: name
The complexity of verifying (and computing) the hash of tree nodes depends on the number of its entries. To reduce that cost on large nodes, Tezos distinguishes between tree nodes holding at most 256 entries and tree nodes with more than 256 ones. The small nodes are represented as a list of entries and have a linear complexity, while large nodes are using an optimised representation – similar to inode tables implemented by filesystems – to handle large directories with a logarithmic complexity.
Until now in Tezos (i.e. Edo), all the directories have at most 256 entries and so do not use this optimised representation. This might change with sappling.
Tree nodes representations
The encoding of a tree node depends on the number n
of entries present in that tree node.
Directories with at most 256 entries are encoded as a flat list of entries.
As already described above, a tree entry is a triplet: name
8 | (LEB128) |
len(name) |
8 | 32 |
---|---|---|---|---|
kind |
\len(name) |
name |
\32 |
hash |
kind
is either:
"\255\000\000\000\000\000\000\000"
if the entry is pointing to a contents."\000\000\000\000\000\000\000\000"
if the entry is pointing to a node.Example: encoding an entry whose name is "protocol"
, pointing to some contents with hash <hash>
.
8 | 1 | 8 | 8 | 32 |
---|---|---|---|---|
\255\000\000\000\000\000\000\000 |
\008 |
protocol |
\000\000\000\000\000\000\000\032 |
<hash> |
A node is a list of at most 256 entries, sorted by lexicographic order of names. A node with
8 | … | ||
---|---|---|---|
\n |
encode(e_1) |
… | encode(e_n) |
where
Because of storage optimisations implemented in Tezos, the encoding of directories with more than 256 entries is computed in a different way, and requires some transformations prior to its computation. In fact, inodes are themselves represented as trees. There are two kinds of inode objects: values (leafs) and trees.
An inode value contains a flat list of tree entries.
As already described, a tree entry is a triplet: name
(LEB128) |
len(name) |
1 | 32 |
---|---|---|---|
\len(name) |
name |
kind |
hash |
where kind is \000
for nodes, and \001
for contents.
Example: encoding an entry whose name is "protocol"
and pointing to some contents with hash <hash>
.
1 | 8 | 1 | 32 |
---|---|---|---|
\008 |
protocol |
\001 |
<hash> |
An inode value is a list of at most 32 entries, sorted by lexicographic order of names. A value with
1 | 1 | … | ||
---|---|---|---|---|
\000 |
\n |
encode(_1) |
… | encode(e_k) |
where
Inode trees contain a list of inode pointers
An inode pointer is a pair: index
1 | 32 |
---|---|
index |
hash |
An inode tree is a triplet: depth len(entries)
aggregates the total number of entries that are reachable via the pointers. This information can for instance be used to decide whenever an inode value has to be converted into a inode tree (or the reverse) when a new entry is added (or removed) from the tree.
An inode tree with k
pointers
1 | (LEB128) |
(LEB128) |
1 | 33 | .. | 33 |
---|---|---|---|---|---|---|
\001 |
depth |
len(children) |
\k |
… |
where
Example: encoding an inode tree of depth 0
, length 50000
and a list of pointers encoded by
1 | 1 | 3 | 1 | 33 | .. | 33 |
---|---|---|---|---|---|---|
\001 |
\000 |
\160\194\030 |
\032 |
… |
So far we have described only how to encode inodes but not how to construct them. This section explain how to do this and turn a list of entries into tree-like structure of inode trees and values.
The goal is thus to partition a list of entries of size greater than 256 into disjoint groups of entries of at most 32 elements, corresponding to inode values described above; and to create intermediate inode trees to efficiently dispatch names to the right leaf value using inode pointers.
The crux of the algorithm is to compute indices. Given an entry with name n
, its index at depth d
is defined as:
index(d, n) =
ocaml_hash(d, n) mod 32
where ocaml_hash
is a custom, non-crypto hash function defined by the the OCaml's runtime and based on MurmurHash3 and defined as follows:
mix(h: <uint32>, w: <uint32>) {
w *= 0xcc9e2d51;
w = w << 15 | w >> 17;
w *= 0x1b873593;
h ^= w;
h = h << 13 | h >> 19;
h = h * 5 + 0xe6546b64;
}
uint32 ocaml_hash(seed: <uint32>, s: <string>) {
uint32 h = seed;
for (i = 0; i < len(s); i = i + 4) {
uint32 w;
if big_endian_machine
w = rev(s[i:i+4])
else
w = s[i:i+4];
mix(h, w);
}
// At this point there might be up to 3 bytes left to read.
if i <> len {
// Bytes that are out of range should be set to \000.
uint32 w = rev(s[i:i+4]);
mix(h, w);
}
h ^= (uint32) len(s);
h ^= h >> 16;
h *= 0x85ebca6b;
h ^= h >> 13;
h *= 0xc2b2ae35;
h ^= h >> 16;
return h & 0x3FFFFFFFU;
}
First, let us define a few notations. Let
Let
The following function partitions recursively a list of entries
The encoding of a tree node with