ERC-4337 P2P Spec Feedback

Edit Date: 2023/10/12

Spec Commit: 2351f17

The following is a collection of feedback for the ERC-4337 P2P specification. I'll be adding to this as I learn more.

Discovery

There is currently no described mechanism for a node to find peers that support a particular mempool. The discovery process should provide at least some efficiency mechanisms to improve this.mempool_nets is undefined in the spec beyond "mempool subnet subscriptions" which has no meaning. There is a partial definition of how this looks in the ENR here.

Status Quo

A reasonable goal for a node is to try to maintain N peers per mempool to improve latency for receiving new UOs. To do this with the current protocol:

Discover any peer (no way to filter)
Dial that peer
Request status/metadata
Determine mempool matches, either keep peer or drop.

Proposal

We are limited to 300 bytes of information that we can pack into an ENR. Thus, we cannot put each supported mempool ID into the ENR directly.

Instead, we can define a "mempool subnet" as a mempool bucket. A mempool is assigned to a bucket by: mempool_id % MEMPOOL_ID_SUBNET_COUNT. The ENR structure is:

Key	Value
mempool_subnets	SSZ `Bitvector[MEMPOOL_ID_SUBNET_COUNT]`

Where MEMPOOL_ID_SUBNET_COUNT: 1024 (SSZ 128 bytes).

Note: A much more efficient struture would be a "sparse" bitvector. Nothing says we have to use SSZ here. We could only store the indexes/values of non-zero bytes in the bitvector. If MEMPOOL_ID_SUBNET_COUNT >> MAX_SUPPORTED_MEMPOOLS this should compress very nicely. This is worth exploring further, as increasing MEMPOOL_ID_SUBNET_COUNT reduces the chance of a false positive on a peer that supports a subnet not supporting the particular pool.

Now to discover a peer that supports the mempool ID:

Map a mempool ID to its subnet
Discover a peers filtering to find an ENR that supports the subnet
Dial peer
Request status/metadata
Determine mempool matches, either keep peer or drop.

This should reduce the amount of dialing needed to find a peer that supports a particular mempool by a factor of MEMPOOL_ID_SUBNET_COUNT.

Request/Response

`status`/`metadata`

status should be reserved for more dynamic data while metadata should be used for more static data. metadata should only be requested upon initial handshake, and when a ping noticies that the metadata_seq_number has changed.

We should move supported_mempools to metadata and add syncing data to status.

status:

(
    block_hash: u256
    block_number: u256
)

block_hash: Hash of the last processed block.
block_number: Number of the last processed block.

Rational: status can be used by nodes to determine the syncing progress of a peer. Peers need to keep their mempools up to date or they risk sending invalid user operations. By adding these fields to status a node can determine when a peer is unhealthy.

This better tracks how both devp2p and the consensus spec define their status endpoints.

metadata:

(
  seq_number: uint64
  mempool_nets: Bitvector[MEMPOOL_SUBNET_COUNT]
  supported_mempools: List[Bytes32, MAX_SUPPORTED_MEMPOOLS]
)

seq_number: metadata sequence number.
mempool_nets: Mempool subnets as defined above.
supported_mempools: List of supported mempool IDs.

Rational: Changing the list of supported mempools should be a relatively rare occurance and should require special processing by the node including increasing their metadata sequence number.

Current implementation

If we keep the current implementation we should change the framing for the status endpoint. Its request/response should be sent in a single chunk. Not:

Responses that consist of a single SSZ-list send each list item as a response_chunk.

`pooled_user_ops_hashes`

Suggestion: Drop the s => pooled_user_op_hashes

Drop `mempool` from the request

Peers have already negotiated the common mempools in their connection. Peers should only need to send a single pooled_user_op_hashes request and should receive hashes for each mempool in common. This will limit the amount of requests a user needs to send.

It is not important which mempool a UO is associated with. The peer needs to re-validate the UO anyway, and can determine its mempool association at that point.

New request:

(
  offset: uint64
)

offset: Offset into a list of UO hashes. This this should be determined once by a peer and should time out.

Offset list timeout

The offset field only has meaning in the context of a previously sent pooled_user_op_hashes request. We should define this as the following:

When offset = 0 a node should retrieve a list of UO hashes on shared mempools, up to a limit (maybe MAX_OPS_PER_REQUEST * 8). If this list > MAX_OPS_PER_REQUEST it should persist the portion that it hasn't yet sent.
When offset != 0 the node should return UO hashes from the above list, skipping any already sent.
After N seconds a node should time out this list and any request with offset != 0 should be treated as invalid. Suggestion: N = 1.
Any request with offset = 0 causes a new list retrieval and timer restart.

`pooled_user_ops_by_hash`

The size of this response is currently unbounded. We need to cap the maximum size of the response and mark it as a protocol violation if a sender sends more than this size.

Suggested cap: 10MB

Gossip

`user_ops_with_entry_point`

As defined user_ops_with_entry_point will incur large amounts of inefficiency in the gossip network.

Libp2p gossipsub deduplicates message by message ID. Nodes gossip IHAVE messages to notify peers of the message IDs that they currently have for a topic. Peers then look at their local cache of message IDs, and if they have not yet received a message for that message ID they issue an IWANT message to the peer that gossiped the message ID.

The current spec defines message IDs as:

SHA256(MESSAGE_DOMAIN_VALID_SNAPPY + snappy_decompress(message.data))[:20

and user_ops_with_entry_point as:

class UserOperationsWithEntryPoint(Container):
    entry_point_contract: Address
    verified_at_block_hash: uint256
    chain_id: uint256
    user_operations: List[UserOp, MAX_OPS_PER_REQUEST]

If two messages have a different list of user ops, they will have different message IDs. Peers will be required to issue IWANT messages for these full lists, and then do the UserOp deduplication internally. This will massively increase network utilization.

Suggestion

The gossip message for user operations should be redefined as verified_user_operation.

class VerifiedUserOperation(Container):
    verified_at_block_hash: uint256
    user_operation: UserOperation

Note: Even this adds inefficiencies as there may be multiple messages for a given UO that was verified at different block hashes. Ideally a UO should only be verified at 1-2 different blocks so this shouldn't add too much overhead to the network.

Gossipsub internally batches message ids into the IHAVE and IWANT messages, so no need to do batching in this message itself.

entry_point_contract and chain_id are redundant pieces of information. entry_point_contract should be implied from the mempool definition. chain_id should be implied from the network itself. If we feel we need chain_id somewhere we could add it to the ENR or to the metadata.

Mempool Topic Deduplication

The libp2p Gossipsub specification contains the following.

In all pubsub implementations, we can first check the seen cache before forwarding messages to avoid wastefully republishing the same message multiple times.

This seen cache is shared across topics and caches message IDs.

If a caller attempts to publish the same message ID on two different topics, the 2nd attempt will be deduplicated and fail.

Rust implementation

Go implementation

The definition of message ID above doesn't incorporate the topic string its being sent on, and thus if the same message is sent to two mempool topics, it will be deduplicated by the 2nd request and will not reach subscribers.

If a node has a UO that matches two separate mempools it can only send that UO to the subscribers of a single pool.

More recent definitions of the consensus spec now include the message topic in the message id. See here.

Problem

If we incorporate the topic into the message ID, peers lose the ability to deduplicate their IWANT messages. Each UO will be requested by each peer on each mempool that it matches and the peer supports.

Options

The current message ID definition should change to incorporate the topic ID. This will at least ensure that a UO is sent to each mempool it matches on. The downside here will be network inefficiencies.
A small change to Gossipsub would remove this issue. The seen cache should be split into a read-side and write-side cache. The read-side cache can be cross-topic. The write-side cache should be per topic. This would ensure that a message ID is broadcast on all topics it applies to, but a peer knows to only read that message once.

I asked a question about this on the libp2p forum. Would be helpful to talk to maintainers.

This is a more radical suggestion, but we could switch to an entirely different pubsub implementation, or just use the request/response domain. For now, I'm much more inclined to go with option (1) and push for the change described in (2) before considering this.

Misc

MESSAGE_DOMAIN_VALID_SNAPPY is undefined.

Suggestion: 0x0100000000

MESSAGE_DOMAIN_INVALID_SNAPPY is undefined.

Suggestion: 0x0000000000

denniswon.eth

2023/10/13 00:08:48

Request status/mempool

what are the incentives to honestly share status/mempool?

Dan Coombs

2023/10/13 11:39:44

No direct incentives, but peers should disconnect from peers that are sending invalid UOs consistently. Status should be treated as a hint.

Yoav Weiss

2023/10/14 11:05:12

Unrelated to the original question, but I should point out that if a peer sends you an invalid UO, even just once (evaluated against the blockhash that the peer sent), you disconnect and ban it. That peer is a spammer, as there's never a legitimate reason to send a UO to a bundler that will consider it invalid (that is - not a member of any mempool that would consider it valid). The blockhash removes any doubt that it was valid at some point.

2023/10/14 11:06:19

You might also want to disconnect ones that consistently send you irrelevant ops - ones that come with a blockhash that is not a part of recent history. But it's less important because you silently drop these ops without simulating.

Parthasarathy Ramanujam

2023/10/13 14:52:51

This is sensible. I'll amend the spec accordingly.

2023/10/13 14:55:22

I presume this is the latest block number and block hash reported by the RPC of the bundler.

2023/10/13 14:56:23

If the reported blocknumber is not latest, should we instruct the peer to be disconnected? We should allow some leniency here.

2023/10/14 11:13:21

and when the gap is less than 2 blocks you should simulate the ops. Otherwise you end up dropping valid ops around the time of a new block propagation. In this corner case, you may have to simulate twice for spam detection. Simulate against the current block, if valid - propagate (with the current blockhash). If invalid, simulate with the blockhash reported by the sender. If invalid there too, block the sender as a spammer.

shahaf

2023/11/27 11:25:38

In this case you open up the possibility for malicious peers to consistently spam you with UOs that were valid during the stale block number, but were invalidated by the newer block. What do you do in such case? or do you think that it's not an issue?

2023/11/27 15:24:42

The spammer can't fake such ops, it has to pick ops that satisfy the following conditions: 1. They were valid in a recent block. 2. They were not included in a block (otherwise the bundler knows not to simulate them). 3. Invalidated by a state change (not validity time, as that would be detected without simulation since the validity time range should be propagated and only verified in simulation). 4. Were not previously seen by the bundler, or else it would ignore them upon repeated propagation

2023/10/13 15:02:48

This has been fixed in the latest PR (awaiting merge).

2023/10/13 15:04:02

pooled_user_ops_by_hash The size of this response is currently unbounded. We need to cap the maximum size of the response and mark it as a protocol violation if a sender sends more than this size.

Agreed. I'll update the spec.

2023/10/14 11:22:47

entry_point_contract and chain_id are redundant pieces of information

In the context of a specific mempool they are redundant, as they both entry_point and chain_id are mandatory in the mempool description (whose hash determines the mempool ID). But keep in mind that at p2p network level they're not. E.g. during upgrade there will be two valid entry_point addresses living in separate mempool IDs. Two peers may be members of both, but messages are always in the context of one. An op of one entrypoint is obviously only valid in its appropriate mempool.

Dror Tirosh

2023/12/13 10:40:47

entry_point_contract

not sending the entryPoint with the message is problematic: - we strive to send a message once between nodes sharing multiple mempools. meaning, from the message itself, we can't tell the mempool it was verified against, and thus can't tell the entrypoint either.