owned this note
owned this note
Published
Linked with GitHub
# Forest with Narwhal+Tusk/Bullshark
This document contains notes and observations in preparation for using [Narwhal](https://arxiv.org/pdf/2105.11827.pdf) and [Bullshark](https://arxiv.org/abs/2201.05677) as a custom consensus in [Forest](https://github.com/ChainSafe/forest), similar to how [Delegated Consensus](https://github.com/ChainSafe/forest/tree/main/blockchain/consensus/deleg_cns) was added. In particular we are looking at how to integrate the [Mystenlabs Narwhal](https://github.com/MystenLabs/narwhal) implementation, rather than starting from scratch.
## Narwhal + Bullshark Overview
The basic premise of Narwhal is that it's DAG based atomic broadcast mechanism that can replace the mempool to radically increase throughput. Each node can have multiple workers ingesting transactions, batching them, then broadcasting them to other peers on the network. The atomicity comes from the fact that not only does it broadcast batches, it also collects a quorum of signatures from the receivers, and forms availability certificates to show that the transactions are available on a supermajority of nodes.
Based on the local view of the DAG of transaction batch certificates, Tusk or Bullshark can establish a total order of transactions without further communication.
## Impact on Forest
There are multiple ways to integrate this into Forest and Filecoin. So far we tried to preserve the basic notion of a Filecoin chain structure by constructing a chain that adheres to the longest chain rule, albeit with custom weights and validation rules.
Now, however, the transactions are ordered outside the system, which also makes sure they are eventually available on each node, so a lot of the basic functionality of Forest become redundant. For example the mempool gossip and the block sync mechanisms can be completely bypassed.
Some proposals on what the end result could look like are discussed [here](https://github.com/protocol/ConsensusLab/discussions/165).
## Mystenlabs Narwhal Overview
The primary use case for the Mystenlabs implementation currently is benchmarking, but it's being developed with production in mind.
It has a single `node` [executable](https://github.com/MystenLabs/narwhal/blob/v0.1.1/node/src/main.rs) which can be `run` either as `primary` or `worker`. Each validator has to run one primary and at least one worker processes.
### Services
The nodes use multiple gRPC services to communicate with each other. The interfaces are described in [narwhal.proto](https://github.com/MystenLabs/narwhal/blob/bfce59743da7b7631fef7d69a097728452225c59/types/proto/narwhal.proto).
The following communication channels/services exist:
- `PrimaryToPrimary`: Remote P2P messaging between primaries of different nodes.
- `PrimaryToWorker`: Local messaging between the primary and its workers.
- `WorkerToPrimary`: Local messaging between the worker and its primary.
- `WorkerToWorker`: Remote P2P messaging between workers of different nodes, for transaction batch dissemination.
- `Transaction`: Workers run this service to ingest transactions from clients.
The message payloads are not visible in `narwhal.proto`; they can be found [here](https://github.com/MystenLabs/narwhal/blob/bfce59743da7b7631fef7d69a097728452225c59/types/src/primary.rs#L634).
There is an example [docker setup](https://github.com/MystenLabs/narwhal/tree/bfce59743da7b7631fef7d69a097728452225c59/Docker) with four nodes. It includes a [committee.json](https://github.com/MystenLabs/narwhal/blob/bfce59743da7b7631fef7d69a097728452225c59/Docker/validators/committee.json) file that contains all the different gRPC service addresses under the public key of each validator.
The primary can either run Bullshark as an internal consensus mechanism, or if that's disabled then [start](https://github.com/MystenLabs/narwhal/blob/bfce59743da7b7631fef7d69a097728452225c59/primary/src/primary.rs#L413-L426) a set of consensus related [gRPC services](https://github.com/MystenLabs/narwhal/blob/bfce59743da7b7631fef7d69a097728452225c59/primary/src/grpc_server/mod.rs#L95-L97):
- `Validator`: This is where the validator can query transaction batches and their causal history, and tell the service when it can remove old batches that it has consumed.
- `Proposer`: Not clear what it does.
- `Configuration`: Allow the validator to initiate epoch switches, submitting new committee composition.
**TODO:** It's not clear how we are supposed to initiate committee changes when internal consensus is used. Reconfiguration notifications can go from worker to primary and vice versa, but I can't find any way to enter the information other than the `Configuration` service.
> As pointed out by Alberto, we can use the [NodeRestarter](https://github.com/MystenLabs/narwhal/blob/bfce59743da7b7631fef7d69a097728452225c59/node/src/restarter.rs#L17) as demonstrated in a [reconfiguration test](https://github.com/MystenLabs/narwhal/blob/bfce59743da7b7631fef7d69a097728452225c59/node/tests/reconfigure.rs#L173-L195), where the costom execution state implementation [triggers the epoch transitions](https://github.com/MystenLabs/narwhal/blob/bfce59743da7b7631fef7d69a097728452225c59/node/tests/reconfigure.rs#L61-L81).
### Execution
The node also starts an [executor](https://github.com/MystenLabs/narwhal/blob/bfce59743da7b7631fef7d69a097728452225c59/node/src/lib.rs#L286-L294), which for benchmarking is just a [dummy one](https://github.com/MystenLabs/narwhal/blob/bfce59743da7b7631fef7d69a097728452225c59/node/src/main.rs#L170), but it is something that can be [passed](https://github.com/MystenLabs/narwhal/blob/bfce59743da7b7631fef7d69a097728452225c59/node/src/lib.rs#L121) to the primary constructor and will receive the notifications about committed batches it can execute. We could use this if we use the `node` as a library, rather than a binary.
The [ExecutionState](https://github.com/MystenLabs/narwhal/blob/bfce59743da7b7631fef7d69a097728452225c59/executor/src/lib.rs#L57) we have to implement is also responsible for remembering the `ExecutionIndices` for crash recovery.
# Integration Options
The code is under active development, so we want to do as few modifications to it as possible. It would be nice to use Bullshark, that is, to run internal consensus and just receive the stream of committed transactions, rather than to use the Consensus API to drive it from the outside.
To start, we can add the project as a [dependency on GitHub](https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html#specifying-dependencies-from-git-repositories) (because it's not published to crates.io), and do our version of the setup that currently happens in `main.rs` as an initialisation of our custom consensus. If we have to modify something we can fork the project.
## Spawning
It seems non-trivial to abstract out the network from the components, to use anything other than gRPC, so we should just let it do its thing without trying to come up with any adapters that go over Libp2p.
We can start by spawning one `Primary` and one `Worker` task. We need to check if they are okay to run on the same machine, or are intended to be on different ones, because they could end up using the same database store path, which could be problematic.
> The answer is yes, as the `NodeRestarter` demonstrates, the `Primary` and the `Worker` can share the storage instance.
It also doesn't seem straight forward to try to inject a different database implementation, to use the nodes IPLD store to cut down on duplication, for example by making transactions added to the mempool readily available for Narwhal by sharing the store.
## Interfaces
Here we either use gRPC (we'd have to generate our own client) or we tweak the code to be able to use in-process communication.
### Execute Transactions
We'd implement the `ExecutionState` to receive committed transactions that we can put into a pending queue until the time is right to form a Filecoin block.
Unfortunately this is not quite good enough for all our needs, because ideally we wanted to use the data stream to receive full batches including empty ones, which would indicate to us the passage of time. This would come handy to recognise when it's time to create a partially filled or empty Filecoin block. While Narwhal produces empty transaction batches (it needs 2f+1 blocks for certificates), we don't get these, the executor [only passes single transactions](https://github.com/MystenLabs/narwhal/blob/main/executor/src/core.rs#L238), which means we are told nothing about empty batches.
This is a problem because in order to generate blocks determinstically, everyone has to agree on when to generate partially filled blocks (that is, generate a block even if it hasn't hit any gas/size limits, rather than wait indefinitely for more transactions to arrive).
We will probably have to modify the `ExecutionState` interface in some way to let us know about the beginning and the end of the batches. This would be similar to the [Tendermint ABCI](https://docs.tendermint.com/master/spec/abci/abci.html#beginblock) calls `BeginBlock` and `EndBlock`, with calls to `DeliverTx` between them; perhaps it's worth noting though that the ABCI++ will pass all transactions at once.
I will have to understand more of the expectations regarding the `ExecutionIndices` which are passed to `handle_consensus_transaction` and returned by `load_execution_indices`; whether we are expected to persist each of them as is, or can do so at the end of each batch for example.
### Submit Transactions
There seems to be no in-process way to pass transactions to the worker, the only way is through the [Transaction service](https://github.com/MystenLabs/narwhal/blob/main/worker/src/worker.rs#L386-L417) using gRPC. It wouldn't be too invasive to add an extra parameter which is a channel we can use internally to send data.
### Epochs and Committees
As I mentioned we have to investigate how to send committee changes when Bullshark is in use. But it also requires a lot of development on the Filecoin side to implement epochs on the ledger, so maybe for the initial version we can start by having a static committee with no support for evolving stake. The subnet using this consensus can just have a version of the `committee.json` file they are going to use and it can stay static for the duration of the demo.
### Keys and Rewards
Narwhal [currently uses](https://github.com/MystenLabs/narwhal/blob/main/crypto/src/lib.rs#L69) Ed25519 keys, whereas Filecoin miners [have to use](https://spec.filecoin.io/#section-systems.filecoin_nodes.repository.key_store) BLS keys. The other supported key is Secp256k1, but I'm not sure if miners can use it.
If the idea is that blocks will be independently and deterministically generated on each Filecoin node, then they can't be signed, and rewards have to be attributed to validators/miners based on other rules. For example a [merit scheme](https://hackmd.io/P59lk4hnSBKN5ki5OblSFg#Merit-block-reward) was suggested to give rewards to whoever batched the transactions.
This will need either:
* some translation between the committe Ed25519 keys and the BLS keys, or
* changing the keys in the Narwhal codebase to BLS or possibly Secp256k1
While we have a static committee, the former option seems perfectly fine.
The tricky part is tracking who signed the batches. Only individual transactions are passed to the executor, so we can:
* wrap transactions with provenance information, although it's questionable whether we can trust other validators not to change this during batching; or
* as discussed add begin/end batch callbacks where we can add some kind of actual signatures to prove who the batch is coming from.
The next problem is how to propagate the information in the Filecoin block to the execution engine, so that it can send some reward message to the actors in FVM. We should not modify/wrap the the message payloads becasuse the execution engine won't know about it, but what we could do is insert synthetic reward transactions at the end of each batch, into the Filecoin blocks.
Although this isn't exactly what we want because the reward should depend on the gas which was spent, which is only available at the end of the block. Maybe some modification of the Filecoin blocks is inevitable here.
Or maybe blocks should be homogenous in terms of who batched them, althought his could lead to quite a lot of varition in block size. Then we could just use the batcher as the miner ID, and the single final gas reward message is trivial to attribute to the validator, because we don't have to worry about how much gas came from which exact transaction.