Service Discovery

# Service Discovery This is inspired by [Building Lightweight Microservices Using Redis](https://medium.com/flywheel-tech/building-light-weight-microservices-using-redis-23f051624647) and its corresponding [RedisConf18 talk](https://www.youtube.com/watch?v=z25CPqJMFUk). First, on startup, create an instance ID based on the enclave ID or sealing keys. It should be stable between startups. This uses Redis as the in-memory service database. Commands below are therefore redis commands. ## Presence Run this every 1 second. This has a TTL (time to live, meaning expiry date) of 3 seconds. Failure to run this for a >3 seconds period therefore means the service has died. ```redis # Indicate the service is present by creating its entry setex mithrilcloud:service:blindai:{InstanceID}:presence {InstanceID} 3 ``` ## Enclave Info Serialize the information you need to publish this way: (JSON) ```json { "instanceId": "InstanceID", "serviceName": "blindai", // in the future, this would allow for "sev", "sev+apm", "nitro"... "platform": "sgx", "serviceVersion": "0.5.1", "policy": { // serialized policy matching the enclave "mrenclave": "x" // most importantly, the mrenclave }, "attestation": "", // the attestation, serialized "pubkey": "", // public key of the enclave "connectionString": "x.x.x.x:xxxxx", // the connection string (ip:port) "dataAccess": { // List of data this enclave has access to. // In BlindAI, these are the sealed model ids that this enclave has access to "cchudant/gpt2": { "hot": true // this means it's currently in memory } } // add more custom info (which cloud is it located in, hostname, ...) } ``` Run these commands every 30 seconds. (TTL: 40) ```redis # Indicate the service is present by creating its entry setex mithrilcloud:blindai:service:{InstanceID}:info {InfoJSON} 40 hset mithrilcloud:blindai:byplatform:{Platform} {InstanceID} \{\} hset mithrilcloud:blindai:bymrenclave:{MREnclave} {InstanceID} \{\} ``` For each data (model) the enclave has access to ```redis hset mithrilcloud:blindai:bydataaccess:{DataAccessID} {InstanceID} \{\"hot\":true,\"mrenclave\":\"xx\"\} ``` where Hot is the state of whether the model is in hot storage (memory) or cold ## Find an enclave This is the enclave manager's role (which is the licensing server for now) This is implemented via a gRPC call `FindEnclave`, in which the user sends its required enclave spec. ```protobuf message EnclavePolicy { string mrenclave; } message FindEnclaveRequest { // optional, if set the returned enclaves should have this model loaded string dataaccess; string platform; // "sgx" EnclavePolicy policy; // optional // number of enclaves to return, default to 1 // let's not support any other number than 1 for now int32 limit; // list of enclave instance ids the user does NOT want the enclave manager to return. // See Note 4 below for why this exists. repeated string enclaveDenyList; } message EnclaveInfo { /* This is basically the JSON enclave info above, but with less fields */ string instance_id; string serviceName; string platform; string serviceVersion; EnclavePolicy policy; bytes attestation; // the attestation, serialized bytes pubkey; // public key of the enclave string connectionString, // the connection string (ip:port) } message FindEnclaveResponse { repeated EnclaveInfo found; } ``` ### Step 1 Three cases; in this order: - User has sent a `dataaccess` string. In this case, we do ```redis hgetall mithrilcloud:blindai:bydataaccess:{DataAccessID} ``` Filter out the results by `mrenclave` if the user provided one. - No `dataaccess`, but user has sent an `mrenclave`. In this case, we can query ```redis hgetall mithrilcloud:blindai:bymrenclave:{MREnclave} ``` - No `dataaccess`, nor `mrenclave`: We could search by arch, but I don't see any usecase as of yet. So just error. ### Step 2 We now have an array of candidate. Take the first `limit` candidates. For each candidates, get the presence. ```redis get mithrilcloud:service:blindai:{InstanceID}:presence ``` If an enclave is absent, we need to keep it aside and remove it from the respective hashmaps (bydataaccess, bymrenclave, byplatform). Also, we need to get the presence of a new enclave, so that we return it to the user. ### Step 3 Then, we get the enclave info using ```redis get mithrilcloud:blindai:service:{InstanceID}:info ``` Depending on what the user asked, we may want to do further filtering based on this info. ### Notes __Note 1__: Caching the results of the gRPC calls in the enclave manager is a quick and easy optimisation that should be considered. __Note 2__: Note that this scheme enables future usecases such as: - Finding an enclave based on its arch/policy/mrenclave, for secret sharing and upgradability purposes. - This does not require a direct connection with any enclave you want to share secrets with. - Licensing(/Dashboard) server should display the list of enclaves on the cloud, display the models the user owns, and on which enclaves they currently are. __Note 3__: Attestation verification should be done on the client side using the results from `FindEnclaveResponse`, before starting the connection with said enclave. __Note 4__: In the unlikely but important case where the client tries to contact an enclave, but the enclave has died within the 3 seconds since last updating its presence, the client should either try another enclave if the server returned any, or call the enclave manager back for another enclave. The faulty enclave should be in the `enclaveDenyList`. If caching is implemented on the enclave manager, it may be useful to throw out the cache entries for enclaves that appear in deny lists, so that we don't return these enclaves on the following calls. __Note 5__: For future scaling, Redis can fairly easily be replicated, sharded using the redis cluster features. See this document: [Cluster Spec](https://redis.io/docs/reference/cluster-spec). Redis is made for that!