ZkitterDB - P2P Storage Layer for Zkitter (Waku + Libp2p)

# ZkitterDB - P2P Storage Layer for Zkitter (Waku + Libp2p) (note: ZkitterDB might not be the final name) This documents describes the architecture of zkitter's GunDB service rewrite using Waku and Libp2p. Note that this will not replace dependence on a SQL database for indexing purpose. However, this will allow nodes and clients to share a lib to pub/sub to the same p2p network, and to synchronize historical data by user/group name or specific post id. Relationships (i.e. threads, like counts, follower counts, search, etc) are still being stored in a SQL database and serve over REST. ## Overview The goal for this data layer is to accomplish the following: 1. Allow users to publish and subscribe to new messages (waku) 2. Allow users to retrieve historical messages (libp2p) ## Using Waku as Pub/Sub Waku is a peer-to-peer pub/sub network that allows different dapps to publish any arbitrary messages with a dapp-specific content topic in real time. The format for Waku content topic is as follows: `/{dapp-name}/{version}/{content-topic-name}/{encoding}` For zkitter, we will be publishing one unique topic per user, plus one global topic for anonymous posting. For a user identify by their ethereum address, we will publish message to: `/zkitter_u/1/0x1234...6789/proto` For all anonymous post without a specific creator, we will publish message to a specific group id: `/zkitter_g/1/interep_github_gold/proto` `/zkitter_g/1/taz/proto` Using unique topic for each users and groups allow clients to subscribe to just the users/groups that they are interested in. ### Protobuf Definition Current `util/message.ts` implementation allow messages of any types to be represented in hex string. We can use that as the payload of the message. In order to verify that a message is created by the creator of the message, we also need to include a signature, which is 64 bytes long for secp256r1. #### Protobuf Definition: ```js message Message { string text = 1; string signature = 2; } ``` ```js message ZKMessage { enum PROOF_TYPE { SEMAPHORE = 0; RLN = 1; } string text = 1; string proof = 2; string publicSignals = 3; // should include xShare PROOF_TYPE type = 4; } ``` ## Libp2p storage Waku Store is a best-effort storage layer that allow users to retrieve published message at a later time. Typically, there are enough Waku Store nodes to provide historical messages in the last 30 days. However, without the guarantee of data completeness, it means that nodes that are older and more available will have an advantage over nodes that are newer and less available. This asymmetry introduce a point of centralization in the future where a network of nodes could monopolize discovery. Although we can use GunDB as the storage layer, we would be back to the same scaling issues that initially caused us to migrate to Waku. One particular is RAM usage. Since GunDB store everything in one JSON flatfile, a full GunDB node will require the same amount of RAM as it requires hard disk space. We can create a new fixed-size peer-to-peer network to store a small amount of data per user in order to guarantee better completeness. For example, if we have 100 millions users, each having a 1MB data limit, this data layer will take about 100TB at maximum capacity. Estimated cost of 100TB of hard drive space is around 2K USD. ([source](https://www.amazon.com/Seagate-3-5-Inch-Internal-Enterprise-ST14000NM001G/dp/B08K98VFXT/ref=asc_df_B07T63FDJQ/?tag=&linkCode=df0&hvadid=385191927323&hvpos=&hvnetw=g&hvrand=17257758536709678006&hvpone=&hvptwo=&hvqmt=&hvdev=c&hvdvcmdl=&hvlocint=&hvlocphy=9032039&hvtargid=pla-822333688936&ref=&adgrpid=82240853001&th=1)) ### Overview The P2P storage layer assign a data blob to every registered user and zk groups. As messages are published via Waku, data are added to the data blob. Once the full 1MB (or 1GB in the case of ZK group) is reached, the user can reset the data blob by archiving all old messages to IPFS, at which point they will start with a new empty data blob. We expect regular user to never have to run into this issue, and only power user to have to perform such action. ### Protocol Specification #### Requirements - access to Arbitrum node - access to Ethereum node - waku - libp2p - leveldb LevelDB is chosen because of it's compatibility in client environment (web, android, and ios). #### Data Blob We can visualize a data blob as an append-only log of messages. On average, each message should be ~200 bytes, or ~3.5kb for ZK message (estimation at bottom section). Each blob should contains the following header: ``` { epoch: NUMBER blobHash: STRING archives: STRING[] // ipfs://hash resets: MESSAGE[] // reset message from user } ``` `epoch` starts at 0, and increment by one whenever a user reset their blob. `blobHash` is an incremental hash of all messages in chronological order. For example, if we have the messages hashes `[a, b, c]`, the blob hash would be `hash(hash(a, b), c)` `archives` is a list of ipfs stored in the formed of `ipfs://hash`. This allow us to support other file sharing protocol in the future. `resets` is a list of signed messages submitted by the user to reset the data blob and start a new epoch. Length of resets and archives should always match current `epoch` #### On Initialization When a new node initializaed, it should ALWAYS watch the Zkitter Registrar contract on Arbitrum and write to `/zkitter/users/` using ethereum address as key. e.g. ```js const db = levelup(leveldown('./zkitter/user')) await db.put( '0x1234', { address: '0x1234', pubkey: '0x4567', txhash: '0x890a', chainId: 42161 // Arbitrum }, { valueEncoding: "json" } ); ``` #### Subscribing New Waku Message Each time a new user is added, we should add a new Waku topic to our observer, which is simply an array of string. Note that it would be preferable if waku could support subscribing to wildcat topic (i.e. `/zkitter_u/1/*/proto`). If this is not possible, we are looking at 6GB of RAM requirement (assuming they are stored in-memory) just for all of the topic characters at 100M users, which is not good, but not the end of the world. We can switch to publish at a more generic topic in future version, such as using one global topic for every messages, or performing a modulo operation on the user address and group all the addresses into N number of topic. e.g. ```js waku.relay.addObserver(async wakuMsg => { // parse protobuf const decoded = proto.UserMessage.decode(wakuMsg.payload); // parse hex into zkitter message const msg = Message.fromHex(decoded.data); }); ``` After receiving the message, we need to do two things: - store the message at path `/zkitter/user/${address}/` using the key `bytewise.encode(msg.date)` (this allows querying in chronological order) - store the message metadata at path `/zkitter/user/${address}/meta` using the key `msg.hash` - recalculate blob hashes e.g. ```js const db = levelup(leveldown('./zkitter/user/${msg.creator}')); const metaDb = levelup(leveldown('./zkitter/user/${msg.creator}/meta')); await db.put( bytewise.encode(msg.date), msg.toJSON(), { valueEncoding: "json" } ); await metaDb.put( msg.hash(), { epoch: 0, // current epoch createdAt: msg.date.getTime(), blobHash: '0x123' // hash(lastBlobHash, msg.hash()) }, { valueEncoding: "json" } ); db.createReadSteam({gte: msg.date.getTime()}) .on('data', function (data) { // recalculate forward hashes in case the new message received is not the latest }) ``` #### On New Peer Connection Every node SHOULD initialize a full sync request after a new peer is discovered. ```js libp2pNode.dialProtocol(peerId, ['/zkitter-p2p/1/fullSyncReq']); ``` #### On Full Sync To handle a full sync, each node MUST follow the steps below: 1. Bob iterate over all users and send a sync request to Alice for each user. The user sync request should include: ```js { method: 'BLOB_SYNC_REQUEST', address: '0x123', epoch: 0, hash: '' // latest blob hash from Bob's node lastUpdated: 1671619119716 } ``` 2. Alice will compare Bob's blob hash and epoch to its own record. 3. If they are the same, Alice should submit a `BLOB_SYNC_COMPLETE` to Bob 4. If they are different, Alice will retrieve, from its own record, the message at `1671619119716`, and compare to see if it matches Bob's blob hash and epoch 5. If they are the same, Alice should return all subsequent messages to Bob ```js { method: 'BLOB_SYNC_DATA', address: '0x123', epoch: 0, hash: '0x456' // latest blob hash from Alice's node lastUpdated: 1671619119716, messages: [] // list of all messages since Bob's last updated time } ``` - if they are different, Alice should return `BLOB_SYNC_ERROR` to Bob. Upon receiving the error, Bob should iterate in reverse order, and submit new `BLOB_SYNC` request until `BLOB_SYNC_DATA` is received. 6. Bob should verify and insert new messages received from Alice. Once all message is processed, Bob should recalculate `blobHash` and compare to Alice's record. 7. If they are the same blob hash, Bob can move on to the next user 8. If they are different, Bob should be able to isolate the messages that Alice doesn't have, and return all of them to Alice using `BLOB_SYNC_DATA` 9. Alice should be able to recompute the new hash using Bob's data, and return `BLOB_SYNC_COMPLETE` to Bob Every node should disconnect from the peer when the encounter unexpected data, such as the following: - Whenever a node received corrupted data in `BLOB_SYNC_DATA`, such as invalid signature, unserializable data, or duplicate messages - If a node received `BLOB_SYNC_ERROR` when requesting a blob sync at the beginning of an epoch. They only reason would be that the peer is purposely messing with you - If a node takes too long to respond ### On Retrieving Content One a node complete a full sync of a blob, it advertise that it can now provide content for a user or a group. ([source](https://github.com/libp2p/js-libp2p/tree/master/examples/peer-and-content-routing#2-using-content-routing-to-find-providers-of-content)) Message request can be made using a bidirectional stream using the protocol `/zkitter-p2p/1/messageReq` ### Full Client vs Light Client A full client is a node that observe all registered users and replicate data blobs for all users and groups. A full client should always implement the `fullSync` protocol. A light client is a node that only observe selected user or group blobs. By default, the web client on Zkitter web UI should run a light client that subscribe to all their own blob + all users they follow. Light client should implement the `lightSync` protocol. ### Data Usage Estimates Based on statistics, Twitter has around 250 million active users per day, with around 500 million tweets per day. Since the distribution is not normalized and assumming the curve skewed to the right, we are using 1 tweet per day as a baseline for estimate usage on Zkitter. The average tweet length is about 28 characters long. Zkitter Estimated Data Usage: - per post content: 56 bytes (2 bytes per char) - per signature: 64 bytes - per message header (creator, timestamp, type): 61 bytes - per message: 181 Bytes - all message per user per year: 66KB A conservative estimates put the data growth rate at about 66KB per user per year. At 1MB, we can accommodate conservative usage for 15 years, and 15x power usage for 1 year. For anonymous post using RLN proof, the proof has 1350 characters (proof, yshare, root, nullifer, epoch, signal hash, rlnIdentifier, and xshare). This puts the data usage for each anon post to be around 3KB. 4chan, the largest anon message board, makes about 900k posts per day on average. We are looking at around one TB per year of storage requirement. We believe we have a long way to go before we reach 4chan's scale. As such, similar to users, each groups will also have a data limit that is initially set to 1GB. This increased limit will also be extended to custom groups. 1GB should be enough data to store about 1.3M ZK Message per group.