Codec Generator

# Prova # Directory Hierarchy Suggestion - `codec_generator` - `test` (with tests) - `rust.ml` a script that use `Codec_generator.Rust.generate encoding` where `encoding` is defined above - `main.ml` that you can run to run tests (`dune exec src/test/main.exe`) - call the `rust.exe` binary (compiled from `rust.ml`) to produce a rust source file - then call the he he rust compiler on the produced rust source file - then call compiled rust binary to execute the generated code - `lib` or nothing or `src` - `rust` - `ast.ml` (with combinators that are often used) - `pretty.ml` (pretty-printing) - `generator.ml` (from `Simplified.t` to `Ast.t`) - `codec_generator.ml` if you want an entrypoint - `codec_generator.mli` (the doc for the lib) - `simplified.ml` - `rust_runtime` (in place of `rust/kernel`) or `runtime/rust` - some `.rs` files used by generated code - `demo_generated_code` or `generated_code_TO_REMOVE` (in place of `rust/gen`) - some `.rs` files generated by `test` In `ast.ml`, can use combinators like: ``` val (@->) a b = Arrow (a, b) ``` For formatting: - Use `Format` directly. - Try an existing library if it becomes too heavy - e.g., [PPrint](https://github.com/fpottier/pprint) ([documentation](http://cambium.inria.fr/~fpottier/pprint/doc/pprint/PPrint/index.html)) - `codec_generator` - `test` (with tests) - `rust.ml` a script that use `Codec_generator.Rust.generate encoding` where `encoding` is defined above - `main.ml` that you can run to run tests (`dune exec src/test/main.exe`) - call the `rust.exe` binary (compiled from `rust.ml`) to produce a rust source file - then call the he he rust compiler on the produced rust source file - then call compiled rust binary to execute the generated code - `common` or nothing or `src` - `codec_generator.ml` if you want an entrypoint - `codec_generator.mli` (the doc for the lib) - `simplified.ml` - `rust` - `runtime` - `lib_generator` - `demo_generated_code` # Codec Compiler (new) ## Goal 1. add code generation to `data_encoding` (the project) as a side library 2. use it for Tezos It's actually a loop: do 1. for one backend (e.g. Rust), test it on Tezos (2.), than do 1. for another backend, test it on Tezos (2.), etc. ## Roadmap ### Checkpoint 0: choose target language Probably TypeScript. Or Rust 1.52.0 (the version used by Octez). ### Checkpoint 1: one trivial case - start with 1 test case which is the first type you want to encode (it can be unit, maybe int makes sense) - write intermediate language, with only one type at first (the one needed for the 1 test case you wrote) - write conversion function from `Data_encoding.t` to this intermediate language - `| _ -> assert false (* TODO *)` for other cases of `Data_encoding.t` when converting to this intermediate language - write (or choose an existing lib) AST of target language (only what you actually need for now, should be very small at the start) - write code generation from intermediate language to target language (for the small part that you have currently) - write pretty-printer for target language (for the small part that you have currently) - write the test framework for this: for each test case: - start with a JSON (it's more readable in the code) - -> encode it with data-encoding - -> decode it in target language (e.g. Rust) - -> encode it back (still in target language) - -> check that the two binary values are equal - ensure test passes ### Checkpoint 1.5: run the tests in the CI ### Checkpoint 2: add all easy cases - Primitives: Integers, booleans, (fixed-length) strings, etc. ### Checkpoint 3: add semi-easy cases - Not quite primitives: enums - Dynamically-sized: (dynamically-sized) strings - Dynamically-sized with looping: lists, arrays, etc. ### Checkpoint 3.5: add semi-hard cases - Variably-sized with looping: N, Z ### Checkpoint 4: add harder cases Product types (tuples and records), sum types (unions). ### Checkpoint 5: recursive types `Mu`. ### Checkpoint 6: test with Tezos test cases Find those test cases on chain for instance. ### Checkpoint 7: make sure it's very robust PBT-testing to generate more test cases, at least for base types. # ---------------- ## Detailing 1 (Data Encoding Part) #### Setup The Test "Framework" To get the tight / fast feedback loop when you code. #### Intermediate Language Probably useful: - to get rid of the `'a` parameter of `Data_encoding.t` - and to simplify the ADT (remove some useless cases, only keep the binary part (for now), etc.) - but make sure to keep all relevant information to be able to produce correct binaries - this is why testing with `tezos-codec` is important and also why you want a tight 1. 2. feedback loop #### Test Cases - have test cases (i.e. maybe binaries values, maybe PBT-generated) - maybe use PBT only for base cases (ints in particular), if you want to - but you need (many) examples cases written by hand for the rest - start with very few test cases and add more as you add more features - use the chain data (> 1M blocks with plenty of actual real-life data with a registered encoding) ## Peter's New Solution - start from encoding registrations - but this uses the description so => same bugs - RP: but we actually can get the full encoding from the ID here, instead of using the description - and we can change to add what we need fairly easily # Codec Compiler (old) ## C Encoding Branch - Data-Encoding's GADT (branch `romain-cencoding`) - Intermediate Representation - C Backend, for binary encoding - JS backend? - Java backend? - Rust backend? - same for JSON encodings?? To generate the code: - link with all encodings (just like the node) - this means the encodings have been `register`ed - so you can list them all and iterate on them to generate all encodings ## Peter's Solution - `tezos-codec dump encodings` - read the JSON output - convert to an intermediate representation - TypeScript: waiting for feedback => binary encoding - started GoLang => binary encoding - use unions??? classes?? - basically: - flatten everything - dump all fields together in a struct - maybe a tag field to distinguish - *or* a struct with one field per case ## Differences 2 major differences: - different intermediate representations - start from `tezos-codec` `tezos-codec` links with all encodings (like the node) and uses the `register` mechanism to list the encodings