Try   HackMD

Building an in-enclave matching engine with on-chain dependencies - approaches and challenges

This doc outlines the high-level technical design of implementing part of the larger on-chain dark pool, specifically the orderbook and on-chain matching engine, inside a secure enclave, which are hardware-based trusted execution environments (TEEs). While the exact taxonomy is out of the scope of this doc, there are subtle differences between secure enclaves and the generalized notion of TEEs. The major difference is that enclaves offer application isolation through hardware guarantees.

Why enclaves? What enclaves?

The key feature that our on-chain dark pool offes is that of open order price shielding, meaning that malicious entitities/traders who want to manipulate the token price cannot do so since the price of open orders are shielded by design and hence cannot gauge the impact trade(s) from their treasury would have on the token price.

As such, we make use of a number of primitives and techniques to achieve this:

  1. Orders consist of two portions, transparent and shielded, with the price of the order comprising part of the shielded portion. Whenever this portion is used for public, on-chain commitments, it is done in a hashed manner, so as to not exposing its underlying value while maintaining cryptographic integrity.

  2. All circuits that deal with raw orders take them in as private inputs.

  3. The public output,

    b of the
    fill()
    circuit, does not expose the price since it is a free variable both in the case of asks as well as bids (see here for more details)

  4. The raw client order

    Oown is only exposed in the secure enclave, where it is used for order matching with other open orders from the orderbook. In essence, hence, all the open orders in the orderbook are exposed only in the enclave and are used for order matching, keeping
    Oown.p
    (the price of
    Oown
    ) exposed only to two entities, the user who placed the order, and the enclave.

The enclave is hence a crucial part of keeping price shielded.

Our enclave of choice is AWS Nitro. Nitro enclaves run on dedicated CPU and memory which is not shared with the untrusted code, which promises to mitigate the threat of side channel attacks, which is one of the major security issues in Intel SGX and other commodity enclaves like ARM TrustZone and AMD SEV, which are based on shared CPU resources. However, this is not without tradeoffs, as is any system that offers even significant advantages over others. Unlike SGX et al, Nitro enclaves do not support memory encryption, which makes recovery in the case of a crash fault (i.e. data availability) a non-trivial problem to solve.

Overview of the matching engine

Before getting into the details of the matching engine itself, here is a quick recap on how the client-enclave dynamics for placing, matching and filling an order work. For details on what exactly the terms 'commitment', 'backing ZKP' and 'enclave signature' imply, please refer to the "Client functions for interacting with the dark pool" section of this doc.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

As is evident from the above, the matching engine has on-chain dependencies, i.e. it matches an order (

O) only when the commitment for that order (
O
) has been published on-chain. Again, since we want to minimize trust assumptions as much as possible, only matching is done by the enclave. The actual filling of the order is done client-side, due to which the enclave returns upto 3x the volume of orders crossing that of the client's to the client. The client side picks up the best orders out of these to fill the order and provides them as private inputs to the
fill()
circuit. This exact client-side process is out of the scope of this doc, however, we now provide a description of how the matching engine on the enclave's end works. All code described is in Rust.

Recall that an order

O consists of a transparent(
t
) and a shielded(
s
) structure. We define it as:

// F_q is the scalar field of curve bn128 struct Order { t: TransparentStructure, s: ShieldedStructure, } struct TransparentStructure { phi: Fq, // 0 for bid, 1 for ask chi: String, // Token address for the target project d: String, // Denomination token address, set to "0x1" for USDC or ETH } struct ShieldedStructure { p: Fq, // Price, scaled by 10^9 with 10^7 precision v: Fq, // Volume, scaled by 10^9 alpha: Fq, // Access key, randomly sampled from Fq }

Now, since our actual matching engine depends on what happens on-chain specifically, whether an order placed by a user is committed on-chain, we need a 'staging area' consisting of orders and their commitments, which the enclave then uses to verify on-chain commitments:

struct Commitment { public: TransparentStructure private: Fq /* the private part of the commitment of an order O is defined as H(O.s), where H is a hash function */ } struct StagingOrder { order: Order, commitment: Commitment timestamp: i32, }

We then define how the actual matching engine. Specifically, we employ a Red-Black Tree to organize

Limit objects by their
limitprice
. The
Orderbook
structure bifurcates these trees into
bid
and
ask
sides, ensuring the best buy and sell prices are quickly accessible. The levels of the LOB are indexed by the price, and each level contains a hashmap of orders indexed by their commitments as the key.

pub enum Color { Red, Black, } pub struct Data { raw_order: Order, raw_order_commitment: Commitment, } pub struct Node { price: Fq, size: u32, color: Color, value_sum: Fq, // sum of (p*v) at this limit price parent: Option<Rc<RefCell<LimitNode>>>, left: Option<Rc<RefCell<LimitNode>>>, right: Option<Rc<RefCell<LimitNode>>>, orders: HashMap<Commitment, Order>, } type LimitNode = Rc<RefCell<Node>>; type LimitNodePtr = Option<LimitNode>; pub struct RBTree { root: Option<Rc<RefCell<LimitNode>>>, len: usize, } pub struct Orderbook { bid_tree: Arc<RwLock<RBTree>>, ask_tree: Arc<RwLock<RBTree>>, }

Design choices

We have carefully considered the data structures used to construct the 'orderbook' that the matching engine will operate on. Since the orderbook will be multithreaded, with many concurrent users/threads accessing it at once, it becomes imperative to make sure that it is thread-safe. As such, we structure the book into two parts: Bid and Ask, each with its own independent RwLock. RwLock (s) are especially useful since they allow for multiple reads but single writes, and are made thread-safe using Arc. Each of the bid and the ask side is a red-black tree (RBTree), and hence writing to either side involves "locking" the whole tree. This means that once a lock is acquired, we can operate/write on the tree as if it were single-threaded. We hence use defined Node as the node of the tree, with the LimitNodePtr type being used for the parent, left and right fields of the node. LimitNodePtr is of the type Option<LimitNode>, and LimitNode is itself of the type Rc<Refcell<Node>>. Utilizing Option<Rc<RefCell<LimitNode>>> for the parent, left, and right fields in a Node facilitates shared ownership and interior mutability within a single-threaded context. The Rc wrapper allows multiple references to a node, ensuring its longevity as long as it's required within the RBTree. Meanwhile, RefCell enables mutable access to the LimitNode, allowing for necessary modifications within the tree structure during operations such as insertions or deletions.

The main advantage that RBTs offer in terms of efficiency is that insertion, deletion and searching within the tree is done in

O(logn) time, where
n
is the number of nodes in the tree. We now provide a high-level overview of the data structures and functions used in the system:

  • Two primary data structures are initialized:
    • staging_queue: Holds orders temporarily before on-chain confirmation.
    • order_book: The main structure with two Red-Black Trees (RBT) for bids and asks.

listen_to_events():

  • Continuously monitors on-chain events.
  • On detecting an OrderCommitted event, it verifies if the commitment matches any order in staging_queue. Matched orders are removed from staging_queue, added to order_book.
  • On detecting an OrderFilled event, it checks for the order commitments of filled orders and removes that order from the book if present using the
    remove_order()
    function.
  • On detecting an OrderCancelled event, it checks for the order commitment of the cancelled order and removes that order from the book if present using the
    remove_order()
    function.

insert_into_book():

  • Determines if an order is a bid or ask using the phi value.
  • Inserts the order into the appropriate RBT in order_book.

insert_into_tree():

  • Manages the insertion of an order into the correct Red-Black Tree.
  • The detailed RBT insertion logic is abstracted in the pseudocode.

match_orders():

  • Continually processes orders from order_book to find matches. Note continually here means as and when a user order is received and
    find_matching_order
    is called on it, and not like a CRON job.
  • Seeks orders matching up to 3 times the volume of the current order.

find_matching_orders():

  • Accepts an order and a volume (3x the order's volume).
  • Searches and returns matching orders up to the given volume.

remove_order():

  • Searches for a specific order in the order_book using its commitment.
  • If found, the order is removed from the respective Red-Black Tree in the book.

Analysis of Sell Order Scenario with and without Block Limit

We analyze the scenario when a Whale holding token A (ticker: TOKEN-A) wants to dump his holdings on unknowing followers on Twitter by tweeting a bullish stance and waiting for the delayed market reaction. We analyze both scenarios, one where there is a block limit for an order commitment to be submitted (i.e., if the order is submitted and enclave signature is received at block

x, then the commitment must be received by block
<=x+M
) and when there is no such block limit (i.e.,
M
in the previous case
).

Setup:

  • TOKEN-A trades at
    0.1
    ETH, which is also the Whale's average buying price.
  • The Whale has
    1,000
    tokens.
  • The Whale's sell order is set at
    0.108
    ETH.
  • Ethereum block time is ~
    12
    seconds.

Scenario 1: With M Block Limit:

The Whale's Actions:

  1. Receives enclave signature at block
    x
    .
  2. Tweets bullish stance on TOKEN-A.
  3. Followers buy TOKEN-A.

Market Reaction:

  • Price rises to
    0.105
    ETH by block
    x+M
    .

Outcome:

  • The Whale must submit commitment by
    x+M
    .
    • If
      price
      >
      0.108
      ETH after commitment: Profit =
      (price0.1)×1,000
      .
    • If
      price
      0.108
      ETH or drops after reaching
      0.105
      ETH: No Profit.

Scenario 2: Without M Block Limit:

The Whale's Actions:

  1. Receives enclave signature at block
    x
    .
  2. Tweets bullish stance on TOKEN-A.
  3. Waits to see market reaction.

Market Reaction:

  • Price rises to
    0.108
    ETH by block
    x+N
    .

Outcome:

  • The Whale can submit commitment at any time.
    • If the Whale submits at
      x+N
      : Profit =
      (0.1080.1)×1,000
      .

Conclusion:

  • With M Block Limit: The Whale's profit potential is constrained by the market's immediate reaction. The Whale can't exploit delayed market reactions.
  • Without Block Limit: The Whale can exploit market reactions by choosing when to submit the commitment, giving them an unfair advantage.

Setting a block limit for valid order commitments

As evidenced by the above scenario, it becomes more difficult for The Whale to dump on followers with precision as

M (the block limit for submitting order commitments) decreases. However, without a block limit (
M
), The Whale can easily gauge the delayed market reaction and take a calculated move to dump on his followers. However, in practicality, since order commitments can very well be delayed for reasons beyond maliciousness (e.g. transactions failing), we need to set a reasonable value for
M
. In most cases, 10 minutes, or translating to blocks (with Ethereum block time being 12 seconds), 50 blocks is a reasonable limit to be set. However, we can also explore setting custom limits based on the market cap/centralization of the token (for example,
MCirculating Market Cap
, or
MHHI
where
HHI
is the Herfindahl-Herschman Index of the token).

If this limit

M is crossed, the order is removed from the staging_queue. Likewise, we check for stale orders in the staging_queue, and remove them if the order commitment is not received. We need to ensure that this is done for an interval, say for every
N
blocks. The 'staleness' or the time from order to removal for an order should at least be
M
. Consider the snapshot at
x+N
. Now for removal at snapshot N, for an order at
x+b
,
b+M<=N
.
bmin=0
, hence we get
N>=M
. Therefore we can set
N=M
and check for every
M
blocks for stale orders. However, as given by the inequality, we can increase this quantity in case our
N=M
approach becomes slow/inefficient. As is evident from this approach, all stale orders in the block range
(x+pM,x+(p+1)M]
are removed from the staging_queue at snapshot
x+(p+2)M
. Say there are a total of
k
stale orders in this range where
k=kpM+1+kpM+2+...k(p+1)M
, where each
ki
represents the number of stale orders at block
i
. The average 'staleness' then is given by
i=1M(2Mi)(kpM+i)/k
. It is easy to see that the minimum value of this is
M
, and the maximum value is
2M1
.

Re-org resistance in the protocol

As seen in the above design, the matching engine is heavily dependent on what happens on-chain (order commitments, filling/cancelling of orders to update the book), since the chain is the source of truth in the protocol. The matching engine can then be said to have an internal 'belief system' which is completely dependent on on-chain activity. The breaking of the chain will hence lead to breaking of his belief system and consequently the entire protocol. It therefore becomes of paramount importance to have safeguards in place to protect against the 'breaking' of the chain which in most cases can be attributed to the chain undergoing a reorg. We explore one potential approach using light clients to handle reorgs:

  • Initialize the Light Client

    • Set up and synchronize a light client with the network.
    • The light client will only download block headers, significantly reducing storage and bandwidth requirements.
  • Store Recent Block Headers

    • Maintain a local cache of the most recent block headers.
    • The depth of this cache should be based on the blockchain's typical re-org depth.
  • Listen for New Headers

    • Use the light client's capabilities to listen for new block headers being broadcasted by the network.
  • Check Parent Header

    • Upon receiving a new block header, compare its parent header hash with the last header hash in your local cache.
    • If they don't match, a potential re-org might have occurred.
  • Determine the Depth of the Re-org

    • If a mismatch is detected, trace back local cache to find the last common block header.
    • The number of headers from this common header to the current header represents the depth of the potential re-org.
  • Request Confirmations

    • Before concluding a re-org, request headers from multiple peers in the network to ensure consistency across sources.

Reverting Process Post Re-org Detection

  • Order Commitment Verification

    • Check the order commitments confirmed in the reorged block.
    • Identify any commitments not present in the new block post re-org.
  • Move to Staging Area

    • Transfer the orders associated with the missing commitments back to the staging_queue.
    • This ensures that these orders are not lost and can be re-processed once they are committed on-chain again.
  • Update Order Book

    • Remove the orders associated with the missing commitments from the order_book.
    • Ensure the order book remains consistent with the current state of the chain.
  • Notification

    • Notify affected users about the re-org.
    • Inform them that their orders have been moved back to the staging area to maintain transparency.
  • Re-processing

    • Once the orders in the staging_queue are committed on-chain again, process them as usual.
    • Add them back to the order_book

Testing framework for reorgs

Reorg resistance is one of the most critical aspects of the protocol, and as such requires extensive testing before being deployed to production. The approach to take for testing is running a local Ethereum network (through tools like Anvil or Hardhat) and simulating a reorg. However, the tooling for this has unfortunately not been built yet. There are two issues for this functionality (this one in Hardhat and this one in Anvil/Foundry), however, they didn't receive attention and were hence closed.

We would need to build this tooling ourselves in either framework in order to test our approach to handling reorgs.

Say we want to start off with this simple approach (as described initially by Krasimir):

Head block is 1, calling

reorg(1) would create block 2 with block 1 as an uncle including all transactions from block 1.

Matthias outlines the steps to do this in Foundry:

We need:

  • An reorg rpc call that accepts
    • the block where to reorg (default 1)
    • a tx filter object

And we can build arbitrary depth reorgs once we have this basic primitive in place.

Handling failures and ensuring data availability

As is evident, the enclave is a critical portion of our on-chain dark pool protocol, since it is the only trusted entity which handles "raw" orders (trusted since it is not client-side) and hence it becomes paramount to ensure the resilience and safety of this sensitive order data against failures.

The above diagram (courtesy this paper) succinctly encapsulates the four broad types of failures we can expect to see within our system:

  • Crash failures: When the internal state of the system is lost. In our case, this is when an enclave terminates when we do not expect it/want it to.

  • Omission failures: This happens usually when the system fails to receive/send messages. In essence, this is tantamount to a timing failure with an "infinite" delay.

  • Timing failures: This happens when the system takes longer than expected to respond (say the limit to receive a response is

    0<t<, and the system takes some time
    tresponse>t
    to respond)

  • Byzantine failures: This is the most broad class of failures, and can occur to due to any arbitrary reason, including the system node being actively malicious.

Handling all of these requires a dependable distributed system, and since we deal with raw, sensitive data which is serviced to and from a client, we need a generalized mechanism for efficiently storing and retrieving this data in case of a failure within the system. To demonstrate why this is important, we consider the scenario of a crash failure:

Say there are initially 4 enclaves, Enclaves A through D, servicing clients. Each enclave is servicing the maximum possible number of clients it can. Now, enclave A experiences a crash failure. To handle this, a new enclave E, is immediately spun up. However, since A has crashed, E cannot communicate with A and hence cannot retrieve the data A was operating on before it crashed.

It is now clear hence, in a such a scenario, a reliable data availability mechanism is needed, using which E can effectively and efficiently retrieve the data A had been operating on before crashing, and using this, can start servicing (erstwhile) A's clients.

We explore the Semi-AVID-PR scheme(paper, video) for data availability in our system. Say we have

n storage nodes (just to be clear these are dedicated storage nodes and not enclaves) from where we disperse to and retrieve our data. On a high level, the protocol consists of the following primitives:

  • Setup
    :
    1λ(pp,sp1,sp2,,spn)
    , represents the initial setup functionality where
    pp
    are the global public parameters and
    spi
    are the local secret parameters for node
    i
  • Commit
    :
    BC
    , takes as input a block
    B
    of data, and returns a commitment
    C
    to the data.
  • Disperse
    :
    BP
    , run by a client and all storage nodes where an input block of data
    B
    is taken as input and certificate of retrievability
    P
    is given for commitment
    C=Commit(B)
    is returned to the client.
  • Verify
    (P,C)a{0,1}
    , takes as input a certificate of retrievability
    P
    and a commitment
    C
    , and returns true or false, depending on whether the certificate is considered valid.
  • Retrieve
    (P,C)B
    , run by all storage nodes and the client. It takes as input a certificate of retrievability
    P
    and a commitment
    C
    at the client, and outputs
    null
    or a block
    B
    of data to the client.

The protocol also ensures privacy by enabling the client to add a blinding factor

b when dispering the data to the nodes, so that the node learns nothing about the dispered information. A detailed construction and proof can be found in the paper.

The protocol is also fairly cheap and Byzantine fault tolerant (

<=f Byzantine nodes where total nodes are
3f+1
, as stated in the paper) with distributing a file of 22 MB among 256 storage nodes, up to 85 of which may be adversarial, requiring a total of ≈ 70 MB of communication and storage, and ≈ 41 s of singlethread runtime (< 3 s on 16 threads) on an AMD Opteron 6378 processor when using the BLS12-381 curve.

Handling users at scale - approaches to rate limiting and load balancing

We use websockets as the preferred communication protocol between the clients and our system. The main reason for this is that websockets allow for a persistent bi-directional connection between client and server, which is exactly what our protocol demands. However, with using websockets comes the issue of scaling long-running client connections mean that the server's ability to scale up or down is limited. We hence need a solution which allows for auto-scaling in the worst case scenario (long-running clients).

We containerize our application and deploy it using Amazon EKS with EC2 instances running Nitro enclaves as worker nodes. Usually, the naive approach taken would have been to write the application to natively handle websocket connections. However, we offload this responsibility to another service (Amazon API Gateway) from EKS. This provides multiple benefits:

  • Easier scaling up/down of pods without worrying about lingering connections
  • Enables use of async calling models and other event driven architecture which would have been limited with native websocket support
  • Allows for advanced traffic management, in particular, throttling

Below is the high-level overview of how this would work (pic credits: AWS Blog):

Here, the websocket logic and connection handling is abstracted away from EKS to the API Gateway. Clients connect to the gateway via a websocket API. The gateway in turn connects with the EKS services using REST APIs (specifically through the enclaves' vsock-proxy which allows for secure processing of external traffic without passing through the larger EC2 instance) which ensures that there are no open connections leading to termination of pods.

Throttling:
Since we use API Gateway to manage client websocket connections, we can implement rate limiting by setting custom throttling limits.