Bootstrapping Private Identity Sets

# <center>Bootstrapping Private Identity Sets</center> <center> <image src=https://hackmd.io/_uploads/ryU22ZFVA.png width=250 /> </center> Introduction --- [Semaphore](https://semaphore.pse.dev/), a project developed by [PSE](https://pse.dev) at the Ethereum foundation, exists and it is dope. It is a primitive which allows identity providers to [create groups](https://bandada.pse.dev/) and issue identities within these groups to users. These identities have two important properties: **Non-attribution:** A user with a Semaphore identity can sign messages in such a way that a verifier of a message can ensure the signer is in a particular group, but without revealing who in the group signed the message. **Uniqueness:** A message can come with a *nullifier* which ensures that a user can not sign two distinct messages within the same scope (a.k.a context). <center> <table> <tr><td><image src=https://hackmd.io/_uploads/r1hgtS5E0.png style="max-height:500px; width:auto;" /></td></tr> <tr><td>Figure 1: Semaphore Overview</td></tr> </table> </center> These two properties make Semaphore very attractive for private voting, messaging and generally any application which would benefit from private authentication. For example, a provider could administer a "resident of city X" group which could be used by residents to privately vote on the best local restaurant, elect the city council, whistleblow corruption in local government, or voice political dissidence. However, Semaphore does not solve other critical challenges identity systems face: **Sybil Attacks:** While it provides the uniqueness property mentioned earlier, it does not prevent a user from [obtaining multiple identities](https://en.wikipedia.org/wiki/Sybil_attack) within a group. Preserving a 1-1 mapping between identities and unique humans is a task left up to the provider. **Trust:** Applications which choose to delegate their identity system must trust the provider to not issue themselves identities used to exploit the app. This is less an issue when a group is administered by a smart contract using on-chain data for registration. But the most valuable identity information is not typically available on-chain. Sybil resistance --- One approach to ensuring a user is not able to obtain multiple identities is to have the provider require the user disclose personally identifiable information (PII) to them. The provider then verifies and stores this PII in a private database upon registration. Now for all future registrations the provider can check to make sure a user can not use the same PII more than once. Great! Now we have some strong assurances that users can not obtain multiple identities. But hold on, this approach has some pretty severe draw backs: * The user has to trust the provider to both know and *securely store* their sensitive information. * If the user wishes to register to a group adminstrated by another provider, they have to disclose their PII again. This leads to multiple copies of their PII being stored in separate databases, which increases the probability of data breaches. * The provider knows exactly who is in each group they administrate. * Storing the PII is not only a liability for the user, but a liability for the provider as well. There is a long standing tradition of incumbent identity providers (government, financial institutions, social platforms, etc.) distributing access to sensitive PII. Whether by accidentally leaving the front door wide open, falling victim to sophisticated attacks, or simply selling it intentionally via data sharing agreements. A lot can be said about their shortcomings but as mentioned above, replicating PII to a new set of providers is not an improvement. Unless you're in the business of vetting and storing private information and capable of doing so securely: don't. Alternative approaches of varying effectiveness do exist such as simply charging a fee per identity, or peer attestation networks to mention a couple. Here we'll focus specifically on identities issued by traditional identity providers. Existing Providers --- A large number of sybil resistant identities already exist today. For example, while not perfect, a nation state has a pretty good idea of how many citizens it has. Further, most have assigned unique identifiers to each of their citizens. Some have even gone as far as issuing cryptographic credentials which can [already be leveraged](https://github.com/anon-aadhaar/anon-aadhaar) by applications. Governments aren't the only ones in the business of keeping track of identities, heres a non-exhaustive list of others: * Utility companies. * Credit agencies. * Banks. That's great and all, so why aren't we using them? There are a number of reasons why we don't see more applications leveraging these existing identities. **Internal:** A lot of the entities which vet and house PII use it for internal purposes and have no interest in being identity providers. **Non-cryptographic:** Many (most) identity providers do not issue cryptographic credentials. Instead they provide forgery resistant physical credentials such as cards and passports. Or they operate API-based solutions of which they can monitor and restrict access such as OAuth, OpenID, or bespoke deployments. Until recently, there simply were no satisfactory methods for people to utilize these in other applications. Login with ~~Google~~ Anything. --- TLSNotary is another open-source protocol developed by [PSE](https://pse.dev). It started its life as an independent project, first conceived in 2013 on a [Bitcoin forum](https://bitcointalk.org/index.php?topic=173220.0). Its purpose is to solve one conceptually simple problem: How can one query a webserver and share the data with another party in a secure way? <center> <table> <tr><td><image src=https://hackmd.io/_uploads/B1NzlbFVR.png style="max-height:500px; width:auto;" /></td></tr> <tr><td>Figure 2: Data sharing</td></tr> </table> </center> The fact the internet is largely missing such a basic functionality is seldom noticed, but has influenced its current architecture to a degree that is hard to overstate (and worth an article of its own). Perhaps you've once thought there must be a better way whilst going through the motions of forwarding someone a screenshot of a page on a website. Using some fancy cryptography, TLSNotary addresses this issue while having some interesting properties: - The webserver is not aware this data sharing is taking place, nor does it have to install any new software. Ergo, almost all existing data on the internet can already be proven. - A Prover can selectively disclose data to a Verifier while hiding private or unnecessary information. - It is designed to be malicious secure, requiring no trust between either party. - The protocol can run on relatively low-power devices and even works in the browser and on mobile. Connecting the dots, TLSNotary can be used to prove any existing identity information on the internet even if the webserver hosting it wasn't designed to be an identity provider. For example, one could log in to a feature-lacking government website and prove their citizen ID to a third party without revealing their login credentials or anything extra. In combination with Semaphore, it's possible to reuse identity data to bootstrap massive private identity sets. This can enable people to use their existing information to join new systems while preserving both privacy and sybil resistance. <center> <table> <tr><td><image src=https://hackmd.io/_uploads/HkNw_B9ER.png style="max-height:500px; width:auto;" /></td></tr> <tr><td>Figure 3: Bootstrapping with TLSNotary</td></tr> </table> </center> Trust minimization --- By simply combining TLSNotary and Semaphore we're already able to do some really interesting stuff! However, recall from the introduction that the privacy and integrity of the registration process still relies on the party administering the Semaphore group. The Semaphore provider knows who joins the group and can trivially insert fake identities if they so desire. Adding yet another trusted authority to the mix is not satisfactory, can we do better? <center> <table> <tr><td><image src=https://hackmd.io/_uploads/SyeGxIc40.jpg style="max-height:400px; width:auto;" /></td></tr> <tr><td>Figure 4: Patrick Bateman wants better</td></tr> </table> </center> Fortunately, we can! Both the privacy and trust issues can be addressed in tandem by adding more parties and using something called [multi-party computation (MPC)](https://securecomputation.org/). First, a Semaphore group can be configured such that multiple $(N)$ parties must come to agreement when adding an identifier. With this, a user registers to a group by proving their identity to all $N$ parties. Any application which wants to incorporate a group can then decide themselves if $N$ is sufficiently decentralized for their needs. To address liveness issues it's also possible to configure thresholds, i.e. requiring an $M$ of $N$ quorum for registration. Now that we have multiple parties we can address the privacy issue using MPC. A full introduction to MPC is out of scope of this article, but put simply: MPC allows multiple parties to compute some public function $f$ on private inputs $x_i$ provided by each party, such that every party only learns the output $y$. $$ \Large f(x_0, \dots , x_n) = y $$ Many MPC protocols can compute arbitrary functions with varying levels of efficiency. Fortunately, in this case we only need to do two very simple things: 1. Generate a secret key $k$ shared among $N$ parties such that each party holds a corresponding share $k_i$ where any subset $M$ is sufficient to recover $k$. 2. Compute a pseudo-random function, for example a secure hash function $\mathsf{H}$ keyed with the shared key $k$ denoted below as $\mathsf{H}^k$. With those two functionalities we have enough to upgrade the registration process to provide much better privacy assurances. <center> <table> <tr><td><image src=https://hackmd.io/_uploads/r1hN6_9NA.png style="max-height:500px; width:auto;" /></td></tr> <tr><td>Figure 5: Private trust minimized registration</td></tr> </table> </center> Now during registration TLSNotary is used to prove a _commitment_ [^1] to the identifier $\mathsf{Com}(id)$ which is subsequently provided as an input to the registration MPC outputting a private identifier $\mathsf{H}^k(id)$. This system has several nice properties: - The private identifiers hide the users' identities from everyone, including the Semaphore providers[^2]. Even the source identity provider can't see who joined the group without actively attacking the system. - Analogous to a [trusted setup](https://pse.dev/en/projects/p0tion) ceremony, if enough Semaphore providers destroy their key shares all identities in the group will remain hidden indefinitely. Albeit, this prevents the registration of new ids. - There already exists many off-the-shelf concretely efficient MPC protocols capable of realizing this functionality. An astute reader may wonder why not just use the commitment $\mathsf{Com}(id)$ as the private identifier? That would be a good question and it's true in some cases that would be sufficient. However, the original identifiers more often than not contain little to no entropy and can be easily recovered by bruteforcing a lookup table. Additionally, as mentioned in the points above, there is value in hiding the members of the set from the source identity provider. Putting it all together --- With tools that exist today, users can be empowered to convert existing and otherwise inaccessible identity information into private Semaphore identities. Instead of duplicating sensitive PII into yet another database, users can use zero-knowledge proofs to prove more general statements about themselves and link it to their private identifiers. Under the hood, Semaphore groups are essentially just merkle trees with some extra cryptographic ornaments. The ideas presented in this article can be framed as "merklizing" existing web databases in a way where the users themselves help port each leaf and privately claim it as their own. <center> <table> <tr><td><image src=https://hackmd.io/_uploads/HJFUYus4R.png style="max-height:200px; width:auto;" /></td></tr> <tr><td>Figure 6: Merklizing the Web</td></tr> </table> </center> In the most basic case, as described in the [Login with Anything](#Login-with-Google-Anything) section, it's possible for an off-chain application to trustlessly tap into any existing identity system without requiring direct integration. This can simplify building things like private anonymous chat, voting apps, or perhaps new social platforms. For cases where we want to unlock this for a broader set of applications, such as on the Ethereum blockchain, things begin to resemble an oracle system (and inherits the associated trust challenges). <center> <table> <tr><td><image src=https://hackmd.io/_uploads/BJUzZQ2ER.png style="max-height:400px; width:auto;" /></td></tr> <tr><td>Figure 7: Shared Identity Sets</td></tr> </table> </center> Further, users can aggregate their group memberships into composite proofs such as depicted below. Keeping in mind that, thanks to Semaphore, it's possible for applications to limit each to a single use in any given context! <center> <table> <tr><td><image src=https://hackmd.io/_uploads/rkKKvjcVR.png style="max-height:400px; width:auto;" /></td></tr> <tr><td>Figure 8: Composite Identity Proofs</td></tr> </table> </center> Moving Forward --- With a high-level overview of how we can bootstrap private identity sets we can now touch on concrete steps which can be taken to realize this. A pragmatic start would be to admit that the path to the trust minimized version comes with a lot of engineering and incentive challenges while the [Login with Anything](#Login-with-Google-Anything) approach could provide an immediate and low investment solution for off-chain products today. With that in mind, below is a potential plan of attack: 1. Develop a modular authentication service, similar to an [OAuth2 Proxy](https://github.com/oauth2-proxy/oauth2-proxy), which makes it extremely easy for application developers to tap into arbitrary identity sources using TLSNotary. 2. Package the above service configured as an identity provider for [Bandada](https://github.com/bandada-infra/bandada/tree/main/libs/credentials/src/providers). 3. Find a flagship product to adopt the solution: - Warpcast can onboard users from existing platforms. - Signal/Telegram can replace phone number verification. - Gitcoin can integrate it into their Gitcoin Passport product. If success indicators are looking good for the above track, the trust minimized effort could be approached: 1. Add features into Bandada for group management with multiple parties, and develop the associated contracts. 2. Implement the threshold secret-sharing and PRF MPC protocols into [mpz](https://github.com/privacy-scaling-explorations/mpz). 3. Develop a minimal-config solution for deploying a Bandada identity attestor into a trusted execution environment. 4. Find partners in the space willing to operate attestor instances until we achieve sufficient decentralization. 5. Select a handful of high-value identity sources for a first phase, eg government websites, and create corresponding on-chain Semaphore groups. Open the flood gates to onboard millions of users with Semaphore identities, perhaps incentivizing with honorary NFTs. Even just the same level of participation as seen for trusted setup ceremonies would be an excellent start. [^1]: This commitment is both _binding_ and _hiding_, for example a secure hash of the id concatenated with a nonce provided by the user. [^2]: If enough Semaphore providers collude they can recover the secret key $k$ and then attempt pre-image attacks to recover all the identifiers in the group.