# Prova
# Directory Hierarchy Suggestion
- `codec_generator`
- `test` (with tests)
- `rust.ml` a script that use `Codec_generator.Rust.generate encoding` where `encoding` is defined above
- `main.ml` that you can run to run tests (`dune exec src/test/main.exe`)
- call the `rust.exe` binary (compiled from `rust.ml`) to produce a rust source file
- then call the he he rust compiler on the produced rust source file
- then call compiled rust binary to execute the generated code
- `lib` or nothing or `src`
- `rust`
- `ast.ml` (with combinators that are often used)
- `pretty.ml` (pretty-printing)
- `generator.ml` (from `Simplified.t` to `Ast.t`)
- `codec_generator.ml` if you want an entrypoint
- `codec_generator.mli` (the doc for the lib)
- `simplified.ml`
- `rust_runtime` (in place of `rust/kernel`) or `runtime/rust`
- some `.rs` files used by generated code
- `demo_generated_code` or `generated_code_TO_REMOVE` (in place of `rust/gen`)
- some `.rs` files generated by `test`
In `ast.ml`, can use combinators like:
```
val (@->) a b = Arrow (a, b)
```
For formatting:
- Use `Format` directly.
- Try an existing library if it becomes too heavy
- e.g., [PPrint](https://github.com/fpottier/pprint) ([documentation](http://cambium.inria.fr/~fpottier/pprint/doc/pprint/PPrint/index.html))
- `codec_generator`
- `test` (with tests)
- `rust.ml` a script that use `Codec_generator.Rust.generate encoding` where `encoding` is defined above
- `main.ml` that you can run to run tests (`dune exec src/test/main.exe`)
- call the `rust.exe` binary (compiled from `rust.ml`) to produce a rust source file
- then call the he he rust compiler on the produced rust source file
- then call compiled rust binary to execute the generated code
- `common` or nothing or `src`
- `codec_generator.ml` if you want an entrypoint
- `codec_generator.mli` (the doc for the lib)
- `simplified.ml`
- `rust`
- `runtime`
- `lib_generator`
- `demo_generated_code`
# Codec Compiler (new)
## Goal
1. add code generation to `data_encoding` (the project) as a side library
2. use it for Tezos
It's actually a loop: do 1. for one backend (e.g. Rust), test it on Tezos (2.), than do 1. for another backend, test it on Tezos (2.), etc.
## Roadmap
### Checkpoint 0: choose target language
Probably TypeScript. Or Rust 1.52.0 (the version used by Octez).
### Checkpoint 1: one trivial case
- start with 1 test case which is the first type you want to encode (it can be unit, maybe int makes sense)
- write intermediate language, with only one type at first (the one needed for the 1 test case you wrote)
- write conversion function from `Data_encoding.t` to this intermediate language
- `| _ -> assert false (* TODO *)` for other cases of `Data_encoding.t` when converting to this intermediate language
- write (or choose an existing lib) AST of target language (only what you actually need for now, should be very small at the start)
- write code generation from intermediate language to target language (for the small part that you have currently)
- write pretty-printer for target language (for the small part that you have currently)
- write the test framework for this: for each test case:
- start with a JSON (it's more readable in the code)
- -> encode it with data-encoding
- -> decode it in target language (e.g. Rust)
- -> encode it back (still in target language)
- -> check that the two binary values are equal
- ensure test passes
### Checkpoint 1.5: run the tests in the CI
### Checkpoint 2: add all easy cases
- Primitives: Integers, booleans, (fixed-length) strings, etc.
### Checkpoint 3: add semi-easy cases
- Not quite primitives: enums
- Dynamically-sized: (dynamically-sized) strings
- Dynamically-sized with looping: lists, arrays, etc.
### Checkpoint 3.5: add semi-hard cases
- Variably-sized with looping: N, Z
### Checkpoint 4: add harder cases
Product types (tuples and records), sum types (unions).
### Checkpoint 5: recursive types
`Mu`.
### Checkpoint 6: test with Tezos test cases
Find those test cases on chain for instance.
### Checkpoint 7: make sure it's very robust
PBT-testing to generate more test cases, at least for base types.
# ----------------
## Detailing 1 (Data Encoding Part)
#### Setup The Test "Framework"
To get the tight / fast feedback loop when you code.
#### Intermediate Language
Probably useful:
- to get rid of the `'a` parameter of `Data_encoding.t`
- and to simplify the ADT (remove some useless cases, only keep the binary part (for now), etc.)
- but make sure to keep all relevant information to be able to produce correct binaries
- this is why testing with `tezos-codec` is important and also why you want a tight 1. 2. feedback loop
#### Test Cases
- have test cases (i.e. maybe binaries values, maybe PBT-generated)
- maybe use PBT only for base cases (ints in particular), if you want to
- but you need (many) examples cases written by hand for the rest
- start with very few test cases and add more as you add more features
- use the chain data (> 1M blocks with plenty of actual real-life data with a registered encoding)
## Peter's New Solution
- start from encoding registrations
- but this uses the description so => same bugs
- RP: but we actually can get the full encoding from the ID here, instead of using the description
- and we can change to add what we need fairly easily
# Codec Compiler (old)
## C Encoding Branch
- Data-Encoding's GADT (branch `romain-cencoding`)
- Intermediate Representation
- C Backend, for binary encoding
- JS backend?
- Java backend?
- Rust backend?
- same for JSON encodings??
To generate the code:
- link with all encodings (just like the node)
- this means the encodings have been `register`ed
- so you can list them all and iterate on them to generate all encodings
## Peter's Solution
- `tezos-codec dump encodings`
- read the JSON output
- convert to an intermediate representation
- TypeScript: waiting for feedback => binary encoding
- started GoLang => binary encoding
- use unions??? classes??
- basically:
- flatten everything
- dump all fields together in a struct
- maybe a tag field to distinguish
- *or* a struct with one field per case
## Differences
2 major differences:
- different intermediate representations
- start from `tezos-codec`
`tezos-codec` links with all encodings (like the node) and uses the `register` mechanism to list the encodings