Challenges with the Use of `call_remote` in Kizuna's P2P Messaging

# Challenges with the Use of `call_remote` in Kizuna's P2P Messaging ###### tags: `Kizuna-architecture-challenges` <style> .ui-infobar, #doc.markdown-body { max-width: 1000px; } </style> --- ## Summary of the problem - Kizuna uses `call_remote` in the following functions in p2p messaging `send_message` `receive_message` `read_message` `receive_read_receipt` - The problem is, in a distributed, peer-to-peer network, multiple agents making a call_remote to a single agent can be challenging - The more agents there are in the network, the more collisions we will have - Currently, most of the functions enumerated above calls `create_entry` in the source chain of the sender and the receiver. This is problematic since we need an assurance that both writes to source chain either succeeds or fails but that is not guaranteed. --- ## Error instances workflow - https://miro.com/app/board/o9J_l7OpnXc=/ --- ## Issues ### Non-atomicity *from https://docs.rs/hdk/0.0.101/hdk/index.html* > HDK is atomic on the source chain ⚛ > - All writes to the source chain are atomic within a single extern/callback call. > - This means all data will validate and be written together or nothing will. > - There are no such guarantees for other side effects. Notably we cannot control anything over the network or outside the holochain database. **> - Remote calls will be atomic on the recipients device but could complete successfully while the local agent subsequently errors and rolls back their chain. This means you should not rely on data existing between agents unless you have another source of integrity such as cryptographic countersignatures.** > - Use a post commit hook and signals or remote calls if you need to notify other agents about completed commits. - Although commits are atomic on the source chain, two source chains trying to maintain consistency of committed entries can be challenging (e.g. a sender's commit can succeed while the receiver's commit can fail). This introduces issues with rollbacks or keeping track of unsynchronized entries between chains. - This really means that until `post_commit` is available, we should only handle one source chain in a single extern call. ### Non-concurrency - Given a scenario with three agents `Alice`, `Bob`, and `Charlie` wherein `Alice` sends a message to `Bob` at the same time `Charlie` sends a message to `Bob`, `Alice`'s call_remote to `Bob` will block other operations on `Bob`'s source chain causing `Charlie`'s call_remote to `Bob` to fail. In practice, this will limit the ability of an agent to only chat with one agent at a time and not receiving messages from other agents while doing so. ### Scalability - The more the number of users grow, the more we are likely to face with collisions and rollback issues ### Concurrent Writes and Post commit - If the following is true for `post_commit` > - Allows the guest a final veto to entry commits or to perform side effects in response > - Executes after the wasm call that originated the commits so not bound by the original atomic transaction - Then, we can in one function, 1. commit the message on sender + receipt 2. call remote on recepient to send the message 3. recepient invoke the `receive_message` function to commit the received message on his/her source chain - the commit can fail (either on the worskpace or on the actual act of committing the element to source chain (i.e. source chain head moved error)) on recepient - in which case, the call_remote should also fail (dont know if that is the case) - Will the error from the callee's function get returned to the caller in `ZomeCallResponse::Ok(ExternIO)`? - If yes to error being returned from the callee of call_remote, the post_commit should be able to roll back the local commits to avoid inconcistencies between 2 source chains - And if we veto, we should be able to return an appropriate error as well to the host so that the frontend (i.e. UI) can retry the zome fn call or let the user decides what to do. --- ## Workarounds ### General Strategy - Implement the Granular steps for each **act of sending a message** for temporary workaround - use post_commit to be able to call_remote to receiver in the same extern as committing to sender's chain - PUT `read_message` ON HOLD to lessen complexity ### Receipts as source of integrity - Mechanism: - we implement three receipt statuses 1. SENT *- will be changed to ->* **SAVED** - state wherein a message is committed to the sender's source chain but not yet on the receiver's source chain 2. **DELIVERED** - state wherein a message is committed to both the sender's and receiver's source chains 3. **READ** - state wherein a message is read by the receiver - ON HOLD until further notice 4. Should we add DELETED? - note: READ -implies-> DELIVERED -implies-> SAVED - if a message is marked as SAVE - display on the UI that the message is not yet delivered while retrying sending in the background - if the retry fails, prompt the user to either resend or delete the message ### Granular steps for each **act of sending a message** - Mechanism: - The act of sending a message means 1. alice commits the **message** and the **receipt (saved)** to her source chain. 2. alice call_remotes to bob to commit the **message** and **receipt (delivered)** to bob's chain (call_remote returns the delivered receipt to alice) - This call_remote `emit_signal` to bob's UI the message and receipt 3. alice commits the **receipt(delivered)** to her source chain - we implement separate function for each of the steps in act of sending a message which then can be used as a retry function 1. `save_message` retried when a message fails to commit on the sender's chain note: diagnose why exactly this happens in practice 2. `send_message` retried when sender fails to commit the message on the receiver's chain sample case: call remote failure 3. `mark_message_as_delivered` retried when sender fails to commit the delivered receipt to own chain note: diagnose why exactly this happens in practice - each function represents the step in **act of sending a message** - functions will operate on an exponential backoff (i.e. for every failure on the retried commit, we would have an increasing wait time before the next retry is attempted) - cons: - even if we successfully handle all cases, handling all collisions as the number of users increase might be troublesome and would incur costs of extra network calls - retry attempts compound the time an agent is unavailable to receive messages from others or send messages to others (???) ### Constraints - When a message status is `save`, it could mean two things. - a message is only committed on the source chain of the sender *but has not been committed to the source chain of the receiver* - In this case, we would like to ask the user if he wants to resend the message - the message has been committed both on sender/receiver's source chain *but the sender failed to commit the `delivered` receipt on his source chain* - In this case, we need to confirm that the message has been committed on the SC of the receiver, and then retry to commit the delivered receipt. ### Post commit *from https://docs.rs/hdk/0.0.101/hdk/index.html* > fn post_commit(headers: Vec<HeaderHash>) -> ExternResult<PostCommitCallbackResult>: > - Allows the guest a final veto to entry commits or to perform side effects in response > - Executes after the wasm call that originated the commits so not bound by the original atomic transaction > - Guest is guaranteed that the commits will not be rolled back if Ok(PostCommitCallbackResult::Pass) is returned > - Input is all the header hashes that were committed > - Only the zome that originated the commits is called > - Any failure fails (rolls back) all commits - Questions: - Does post_commit allow rollbacks of the commits on the sender's chain if the commits on the receiver's chain fails? If yes, this might address the non-atomicity of the send_message functions. ### P2P Message Architecture | Where is the entry stored? | Source Chains | Pairwise DHT using DNA cloning (trusted) | Global DHT (NOT trusted) | | -------------------------- | --------------------------------------------------------- | --------------------------------- | -------------------------------------------------------------------------------- | | Entry Visibility | Private | Public | Public | | Pros | no 3rd party peer holds your data. Only metadata (headers) | no 3rd party peer holds your data (because there is no other party) | easy to implement. <br> supports concurrent actions | | Cons | does not support concurrent actions yet | Not supported in Holo. <br> Costly | 3rd party peers hold encrypted data. <br> exposes social graph (if links are used) | #### Concerns 1. call_remote is blocking - only allows one write at a time - locks source chain preventing concurrent actions - feature request: concurrent writes on SQLite DB 2. DHT exposes social graph - when used with links ### SQLite source chains - supports concurrent writes - may remove blocking calls to call_remote ### Single DHT for everyone where messages are stored encrypted - peer to peer interaction may be traced even if messages are encrypted - exposing social graphs through links ### Individual DHTs for each conversant pair - costly - not feasible in holo yet --- #### send message workflow with retry mechanism (includes multiple zome_fn described in [p2p architecture](https://hackmd.io/kAfxn1GyRXeYNCYmBMfr4g)) ##### zome function involved - send_message - receive_message - read_message - receive_read_receipt ```mermaid sequenceDiagram participant AUI as Alice_UI participant ACC as Alice_Conversation_Cell participant BCC as Bobby_Conversation_Cell rect rgba(255,0,0,0.3) AUI-->>ACC: call send_message note over ACC: the following commits may fail atomically ACC-->>ACC: commit message to Alice's chain ACC-->>ACC: commit receipt "saved" to Alice's chain alt error in any of the commits note over AUI: message/receipts have not been committed to either sender and receiver ACC-->>AUI: return error AUI-->>ACC: call send_message again with exponential backoff else commits successful ACC-->>BCC: call_remote Bob's receive_message note over BCC: any of the following commits may fail BCC-->>BCC: commit message to Bob's chain BCC-->>BCC: commit file to Bob's chain BCC-->>BCC: commit receipt "delivered" to Bob's chain BCC-->>ACC: return alt call_remote Ok note over ACC: the following commits may fail atomically ACC-->>ACC: commit receipt "delivered" to Alice's chain alt commit succeeds ACC-->>AUI: return message + receipt AUI-->>AUI: display message else commit fails note over AUI: message/receipts/files have been committed on both sender and receiver <br> but "delivered" receipt has not been committed on sender ACC-->>AUI: return error + receipt details AUI-->>ACC: call retry_commit_receipt_to_sender ACC-->>ACC: commit "delivered" receipt to Alice's source chain end else call_remote timeout note over AUI: message/receipts/files have been committed on the sender but not on the receiver ACC-->>AUI: return error note over AUI: exponential backoff retry AUI-->>AUI: set max retry AUI-->>AUI: initialize counter to zero loop while less than max retry and zome response is error opt zome call response is a success AUI-->>AUI: break loop end AUI-->>AUI: increase counter by 1 AUI-->>AUI: compute for timeout and wait AUI-->>ACC: call retry_commit_to_receiver rect rgba(0,0,150,0.3) ACC-->>BCC: call_remote Bob's receive_message BCC-->>BCC: commit message to Bob's chain BCC-->>BCC: commit file to Bob's chain BCC-->>BCC: commit receipt "delivered" to Bob's chain BCC-->>ACC: return ACC-->>AUI: return end end opt retry fails AUI-->>AUI: mark message as "failed to send" in UI end end end end ``` ### Requested feature/Possible solution - Request/Questions for Post Commit - The ability to roll back the original commit if a certain logic (i.e. call_remote) fails in `post_commit` similar to what has been said in the hdk doc re: `post_commit` > Allows the guest a final veto to entry commits or to perform side effects in response - If the side effect inside `post_commit` produces an error, will it be returned from the original wasm call? (we want it to, so that we can handle it in the frontend)