HackMD - Collaborative Markdown Knowledge Base

[toc] ## References - [KR 2.1/DAL](https://hackmd.io/HLoGjZyKSeuNXxvbnSEcVQ?view#Key-results-for-DAL-6pw) - [KR 2.3/P2P](https://hackmd.io/wOcQhCcCRNGWVFHUes4APQ) - [P2P design](https://hackmd.io/JUHDNI69RSKiuPGwG7-5eQ) ## Objectives - DAL and P2P without sampling - Show 1M TPS on non-adverserial setting with test network ## Status Update ### General status - 75% to have automatic end-to-end tests without P2P (rollup-node+dal node+L1) - 75% to have p2p realistic test with producer-endorser-consumers exchanging massive amount of data, and analysis tools to assess the success of the experiment ### Current state (POC/DAL) - Can post slot headers - Attest slot headers - Part of the refutation code merged - Dal node can use crypto : split slot/reconstruct slots - Dal node track the L1 and monitor slot headers posted - Rollup node can store slot headers and slots - Integration of the cryptographic primitives into the Tezos library and the environment for the protocol (a protocol can change the constants of the design, e.g. size of a slot, number of shards, ...) - Integration of the SRS into Tezos and the CI (in particular for long tests) - Several integration tests (via Tezt) with partial end-to-end-tests ### Major design change (with respect to May) - Refutation game: Remove the slot subscription from the DAL **and** the sequential ordering ### Current state (P2P) Implemented functionality - Topic subscribtion, - Publishing with lazy-push on topics (without data validation step) Implemented large scale test infrastructure: - Binary with support for behaviour descriptions on top of p2p layer - Infrastructure for collection of all nodes logs - Live monitoring of some metrics on the experiment #### Experiment Run an experiment with 656 nodes, - 256 naive "slot-producers" (shards are just dummy bytes), - 400 "endorsers", downloading their shard data, no change in shards affectation. - Shard size : ~512B - 5-6 shards per "endorsers" - 256 slots x 2048 shards - 50 connections per node 256MB/30 sec -> ~8MB/s (64Mb/s) **Still processing data**, but firsts estimation gives $\frac{1}{4}$ of slots received in 30 sec in average on shards. Hypothesis: - Data loss during sending (need force connect) Experiments' log retrieval is currently a bootleneck for experiment analysis. ## Next steps (POC/DAL) ### Short term (< 3 months) - Refutation game complete (end of the September) - Expose L1 RPCs related to the DAL committee (MR waiting to be reviewed/merged) - Ensure validity of slot headers (No date yet, not very hard) - DAL node should be able to stream slot headers via RPCs - Exporting a convenient RPC by the DAL node for the endorser - Communicate between two DAL nodes via RPCs - rollup node can push DAL pages to the PVM and fetch the slots from a DAL node - Integrate the new FFT that enable better performance by reducing the padding - Integrate the DAL node with the endorser ### Mid term (< 6-9 months) - Cryptographic primitives should be better tested and auditated (internally and externally) - Some design/reflexions around sharing code with the other nodes - Better automation for writing end-to-end tests - Ensure scalability - Fast node should support the DAL - Plug the DAL node with the P2P layer (equivalent of the DDB for the `tezos-node`) ### Long term (< 1.5 years) - Making the nodes resilient to various scenarios such as: attacks,reboot,errors,disconnections, ... - Handling protocol upgrades from the rollup nodes and the DAL nodes - Detect corner cases/edge cases - Preparing integration with the eco-system - External documentation ## Next steps (P2P) ### Short term (< 3 months) - Fix "slow logs retrieval problems" for experiments - Experiment with every node to $500$ connections - Experiment with mainnet's shards distribution, comittee changes at various rate, on long run - Implements - "Force connection mechanism" : smartely use current topics and force connections to some topics the time to send data - have "topics of interest window in maintenance process" to focus on most relevant topic (around the current level) ### Mid term (< 6-9 months) - Better tooling to analyse/visualize tests results - Integration with DAL-node - Reuse test infrastructure with full DAL node - Last iteration on prototype to handle bandwidth need ### Long term (< 1.5 years) - Going from prototype to product - Topic discovery (for data retrieval) - Ensure P2p layer security (bounded messages, peers scoring, ...) - Integrate pubsub to tezos-node's lib_p2p, with backward compatibility. - Heavily test code (unit/integration/large scale) - Ensure maintainability of the code ## Difficulties - Several ressources are new to the code base - Few ressources (more focus on higher priority projects/SCORU) - Summer vacation - Employee leaving the company - Employees on several tasks (helping higher priority projects) - DAC ## Alternatives ### For the demo - Kernel: SC/Tx rollup - 1000 rollups - Bandwith of the DAL: 256 MiB/seconds. Number of slots to be adapted. + 6 months ### Model of the current design: - A cryptographic redudancy of `16` means that: We need to **trust only** 20% of the stake (explanations are [here](https://hackmd.io/g8l2M47eR1eN2WNB-m_6Rg)) . With a redudancy factor of `8`, it will be about 30% - P2P bandwith with this model (factor of $16$) for an endorser with more than $\frac{100}{16}$ of the stake will be around $\frac{1*8*256}{30}=68$ Mib/s (padding is not accounted) - P2p bandwith for a slot producer will be around $\frac{1*8*16}{30}=5$ Mib/s (padding is not accounted) ### Release on mainnet - Change Tenderbake committee every `x` blocks? - Benefit: Less switches for the P2P - About sampling - Benefit: Liveness/Malicious bakers - Break pipelining - Design is still open - It's integration into the code base will take time ## Identified problems - Which latency is allowed? - Assuming an endorser may run several DAL nodes is ok? - Is it ok if the design depends on archive DAL nodes?