owned this note
owned this note
Published
Linked with GitHub
# Fighting Simple Sybils: Levenshtein Distance
## Sybils
Quadratic funding - the mechanism that currently determines the value of Gitcoin grant funding -is inherently vulnerable to Sybil attacks. Sybil attacks are individual humans dividing themselves into multiple "virtual humans" in order to gain additional voting weight. In traditional banking and voting systems, Sybil resistance comes from "KYC" (know-your-customer) which links personal identifying information to some action. In Web3, "KYC" is generaly minimized because it undermines the core ethos of censorship resistance and permissionlessness. This means other methods are required to identify which participants in a grant round are real individual humans, and which are not.
<iframe width="706" height="428" src="https://www.youtube.com/embed/v1Dm7FI2AdU" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
## Sybil Strategies
The goal of Sybil defense is to increase the investment of time and money required for an attacker to convice a grant review system that they are > 1 person to the extent that as rational attacker would not do it. Defenders constantly attempt to push this cost up while minimizing their own expenses, while attackers constantly try to pull the attack cost down. The greater the size of the exploitable pool of funds, the higher cost an attacker will be willing to pay. At the same time, extremely low-cost Sybil attacks are often worthwhile for attackers because even a low success rate can still be profitable if the attack cost is sufficiently low. This means that a robust Sybil defense structure requires systems that identify cheap, simple attacks very effectively and efficiently as well as more complex defenses against sophisticated attacks.
## Simple Sybils
The simplest, cheapest form of Sybil attack is simply to generate a large number of addresses and try to vote with all of them. Ordinarily, these will be sifted out of the grant review system by human reviewers because they usually fail even basic [proof-of-personhood](https://proofofpersonhood.com/) checks. However, there is a substantial cost associated with these human reviews. To optimize the Sybil defense mechanism for high efficacy and low cost, detection of these cheap Sybil attacks must be automated using computationally inexpensive algorithms.
## Levenshtein Distance
Levenshtein distance is a concept from information theory that can measure the difference between two text strings. It is commonly used in natural language processing. The Levenshtein distance between string `a` and string `b` is the minimum number of edits required to transform `a` into `b`. The fewer changes required to change one string to the other, the smaller the Levenshtein distance. The algorithm, in its naive recursive form, is as follows:
<br></br>
![](https://i.imgur.com/Q7vHOId.png)
<br></br>
*where a and b are the strings to be compared, |a| and |b| are the lengths of the strings*
In the context of Gitcoin grants, this idea can be applied to account names that are suspiciously similar. Very similar names might indicate low effort Sybil attempts, for example following account names stand out as very similar and worth investigating to ensure they are genuine individual voters:
```
david1103920
david1103921
david1103922
david1103923
david1103924
david1103925
david1103926
```
Accounts separated by a single edit might not be quite as obvious as those above. Here are some more examples of some slightly more subtle examples of accounts separated by a single edit:
```
ahmeddle
bhmeddle
chmeddle
hombre
h0mbre
hombr3
j1lly
j2lly
j3lly
```
All of these accounts are easily identified with the constraint `Levenshtein distance <= 1`.
## Algorithm performance
The Levenshtein algorithm is computionally expensive in its naive form because it requires recursion over all possible pairs of accounts. However, it can be made performant using [matrix algebra](https://turnerj.com/blog/levenshtein-distance-part-2-gotta-go-fast). This approach was recently taken by the FDD Sybil defense squad using a list of ~298,000 registered Gitcoin accounts (meaning the Levenshtein distance algorithm was applied to a matrix of 298,0000<sup>2</sup> elements). The result was a file with 129,297,647 rows indicating all pairs with edit distances less than 4. As the distance threshold decreases, the number of accounts flagged also decreases, but the likelihood that the flagged accounts are Sybils increases. The threshold distance therefore acts as a calibration slider that the Sybil defense squad can use to balance Sybil detection rate against cost of computation.
## Use cases
The technique has been applied to the full dataset of all Gitcoin accounts as a way to quantify the bulk number of suspicious accounts, but it could also run as a service that runs periodically to flag accounts that get created in some time window. This could become integrated as a gadget in the grant application workflow.
To demonstrate that the Levenshtein distance really is identifying Sybil-like behaviour, some of the accounts identified by the algorithm can be examined manually. These three accounts were flagged by the Levenshtein algorithm as potential Sybils:
[![](https://i.imgur.com/d8Nm9Mq.png =230x180)](https://i.imgur.com/d8Nm9Mq.png) [![](https://i.imgur.com/LS7E5RC.png =230x180)](https://i.imgur.com/LS7E5RC.png) [![](https://i.imgur.com/jDNzgkO.png =230x180)](https://i.imgur.com/jDNzgkO.png)
Examine these accounts in more detail using these links: [Account 1](https://gitcoin.co/martin1156010), [Account 2](https://gitcoin.co/martin1156011), [Account 3](https://gitcoin.co/martin1156012).
The account names are separated by a single edit (Levenshtein distance ==1). All three accounts have donated a total of precisely $37, divided identically between the same eight projects in the same round. None of the accounts have participated in bounties, hackathons, tips or kudos. All three were created at about the same time and have no activity prior to, or following, those eight votes. These seem to be obvious Sybils that were correctly flagged by the Levenshtein algorithm. Furthermore, the three accounts above were just three of eight accounts that were all separated by a single edit, and all shared the same Gitcoin activity profile and voting record.
## Summary
Creating multiple accounts with very similar names is a very lazy attack, and probably easily identified humans in the grant review system. However, identifying them by implementing a Levenshtein distance algorithm to the entire set of accounts in a grant round removes some of the load on the human reviewers and acts as a cost-optimization because the algorithmic apporoach is fast and cheap.
## Further Reading
[Levenshtein distance for beginners](https://medium.com/@ethannam/understanding-the-levenshtein-distance-equation-for-beginners-c4285a5604f0)
[Online Lev calculator](https://planetcalc.com/1721/)
[Levenshtein distance in Python](https://towardsdatascience.com/text-similarity-w-levenshtein-distance-in-python-2f7478986e75)
[Levenshtein distance in 3 languages](https://people.cs.pitt.edu/~kirk/cs1501/Pruhs/Spring2006/assignments/editdistance/Levenshtein%20Distance.htm)