Understanding RLP

# Understanding RLP resources 1. [rlp decoding by derao](https://medium.com/coinmonks/ethereum-under-the-hood-part-3-rlp-decoding-df236dc13e58) 2. [rlp encoding by derao ](https://medium.com/coinmonks/ethereum-under-the-hood-part-2-rlp-encoding-ver-0-3-c37a69781855) 3. [rlp Phan Son tur](https://medium.com/coinmonks/data-structure-in-ethereum-episode-1-recursive-length-prefix-rlp-encoding-decoding-d1016832f919) >Why on earth do blockchains need brand new encoding schemes?We are already familiar with fast and efficient serialization schemes such as JSON, protobuf and MessagePack. One of the important issues with the existing serialization libraries is that they don’t guarantee deterministic serialization. For example, object fields in JSON are ordinarily serialized in an arbitrary order. This is very problematic because blockchains calculate the block hash from the serialized bytes of a block. Indeterministic serialization can give us multiple hash values for the same block. > why can't we define our own hashcode() function per object type(eg: for Transaction, Block, BlockHeader etc..) # RLP encoding rules # Rule 1 When encoding a payload with 1 byte and value in that byte falls within the following range [0x00, 0x7f], in other words within [0, 127] then encode the single-byte as RLP itself ``` note: 1 byte can store elements from [0..255] ``` # Rule 2 When encoding a string (in byte array) that falls between 0 and 55 bytes apply the following logic. ``` 0x80+length(string),string ``` note: ``` 0x80 = 128 ``` eg1: ``` string "dog" to byte-array in hex is [0x64,0x6f,0x67] and size of the byte array is 3 ``` eg2: ``` $ "hello world" to rlp [0x8B 0x68 0x65 0x6C 0x6C 0x6F 0x20 0x77 0x6F 0x72 0x6C 0x64] analysis: 0x8B = 139 = 128+ 11 , where 11 is number of bytes in "hello world" >>> x="hello world" >>> from binascii import hexlify >>> hexlify(x.encode()) b'68656c6c6f20776f726c64' note the 68, 65, 6c, 6c, 6f ... ``` # Rule 3 When encoding a string (in byte array) greater than 55 bytes apply the following rule. ``` 1)0xb7+length_in_bytes(byte_size(string)) 2)length(string) 3)encoded string Join 1,2,3 ``` ``` 0xb7 = 183 ``` eg1: ``` s=“Hello there, I am a very very long string, and I am going get encoded in RLP!” len=s.lenth() == 76 byte_size(s)=76 length_in_bytes(byte_size(s)) => length_in_bytes(76) => 1 explanation: 1. 0xb7+1 => 184 2. 76 3. b“Hello there, I am a very very long string, and I am going get encoded in RLP!” 1|2|3 => 184 | 76 |b“Hello there, I am a very very long string, and I am going get encoded in RLP!” => '0xb8' | '0x4c' | b'48656c6c6f2074686572652c204920616d206120766572792076657279206c6f6e6720737472696e672c20616e64204920616d20676f696e672067657420656e636f64656420696e20524c5021' => B84C48656C6C6F2074686572652C204920616D206120766572792076657279206C6F6E6720737472696E6720616E64204920616D20676F696E672067657420656E636F64656420696E20524C5021 ``` # Rule 4 When encoding a list and the encoded payload in the list is between 0–55 bytes apply the following encoding rule. ``` 1)0xc0+length of (list)) 2)Encoded string Join 1,2 ``` eg1: ``` l=[“dog”, “mouse”, “tigers”, 127] “dog” : 0x83, 0x64, 0x6F, 0x67 “mouse” : 0x85 ,0x6D, 0x6F ,0x75, 0x73, 0x65 “tigers” : 0x86, 0x74 ,0x69 ,0x67 ,0x65 ,0x72 ,0x73 127 : 0x7F concatenate all of them [ 0x83 0x64 0x6F 0x67 0x85 0x6D 0x6F 0x75 0x73 0x65 0x86 0x74 0x69 0x67 0x65 0x72 0x73 0x7F ] and the length of this list is 18(12 in hex) so final output of this rule is: 0xc0+12 : 0xd2 [ 0x83 0x64 0x6F 0x67 0x85 0x6D 0x6F 0x75 0x73 0x65 0x86 0x74 0x69 0x67 0x65 0x72 0x73 0x7F ] Union of 1 and 2 Final output : 0xd2 0x83 0x64 0x6F 0x67 0x85 0x6D 0x6F 0x75 0x73 0x65 0x86 0x74 0x69 0x67 0x65 0x72 0x73 0x7F ``` # Rule 5 When encoding a list and the encoded payload in the list is greater than 55 bytes apply the following encoding rule. ``` 1)0xf7+length_in_bytes(item 2) 2)length(payload) 3)Encoded payload Join 1,2,3 ``` eg1: ## Notes * RLP encodes positive integers in Big Endian format, and discards the leading zeros, and integer value zero is the same as an empty byte array. * There are certain constants for empty lists, string, and integer 0 ``` 1) Empty List [] encoded to 0xC0 2) Empty String "" encoded to 0x80 3) Integer 0 encoded to 0x80 4) Boolean , true encoded to 0x01, false encoded to 0x80 ``` * In short ``` [0x00, 0x7f]: byte : [0..127] [0x80, 0xbf]: string : [128..191] [0xc0, 0xff]: list : [192..255] ``` > q1) Why don’t we use a fixed prefix instead of a dynamic prefix? The main reason is to save the memory space. If we try to use a fixed prefix, we would add them in every single input that we wanna encode and in some situations, the main data is even shorter than the prefix. ofcourse the readability of this encoding scheme becomes better if we have used fixed prefix to denote type > q2) Why did they choose 0x7f, 0x80, 0xbf, 0xc0 as checkpoints? ## Recap * RLP is a set of rules to encode an item or a list of items. * RLP has a different set of rules based on the size of the payload. * Strings is a byte array. * Empty strings, lists have a predefined value. * RLP is used due to its capability to compact data and is simple. # RLP decoding rules # Rule 1 Look at the first byte,The first byte should fall in one of the following ranges : [ [0x00 .. 0x7f] , [0x80.. 0xb7] , [0xb8 .. 0xbf ] , [0xc0 .. 0xf7], [0xf8 .. 0xff] ] and Decipher Data type using the following rule if the byte falls within: ``` [0 .. 127] => String type, decode as it is [128..183] => String type and its short string [184..191] => String type and its long string [192..247] => List type and its short list [248..255] => List type and its long list ``` # Rule 2 Get the length of the byte array # Rule 3 Perform Step one and two all over until the end of the byte array.