A Contributor's Guide to Building Machine Learning Models

# A Contributor's Guide to Building Machine Learning Models The Anti-Sybil Operationalized Process (ASOP) built for the GitcoinDAO Fraud Detection and Defense (FDD) working group is a community-driven system built to enable contributions from any skillset. That said, machine learning properly is in the domain of data science. ## Flavors of Machine Learning Machine learning algorithms fall into three broad flavors: supervised, unsupervised, and reinforcement learning. Reinforcement learning algorithms are adaptive and pick up information from their surroundings. Automated driving is the most common example. Unsupervised learning uses mathematical similarities among observations (a 'row' of data) to group things together. Market segmentation or customer behavior grouping would be standard uses of unsupervised learning. Supervised learning is probably the most common kind of machine learning. The algorithms require that there be a known, factual ground truth. That is, in order for supervised algorithms to work, they need to be able to see how wrong they are. One might be trying to predict the price of a house in Ames, IA or the species of an iris based on various measurements. ## The Machine Learning Workflow Machine learning models (hereafter just 'models') tend to follow a common, repeatable process in their development. We begin with some dataset. We process and clean it--where 'clean' just means that no point is going to break our code. We explore the data to get a sense of what's in it, and think about the type of algorithm we want to use. The type of algorithm depends on what kind of data we have and what question we're trying to answer. In the case of the ASOP, our question is deceptively simple: is this user participating in a Sybil attack on the Gitcoin funding round? Between the Gitcoin frontend and Github.com, the ASOP has a great deal of data at its disposal. Not all of it is going to be useful, and some is only going to be useful once it's transformed into something new. You can learn more about this process of [feature engineering here](/Lc0wi0uiQbq4dmMiHMPQ4g). We use these features to train an algorithm. At present, the ASOP uses a [random forest regressor](https://en.wikipedia.org/wiki/Random_forest) which calculates the probability that a particular user is participating in a Sybil attack. But the number of algorithms currently available as part of a standard Python data science stack [is huge](https://scikit-learn.org/stable/supervised_learning.html). Once the algorithm has trained on the data, we have a model. It's now necessary to validate the model we have, usually by testing it on some kind of 'hold-out' set of the data - part of the data we deliberately exclude from the training process so that our model can be evaluated on what it hasn't seen before. When, or if, the model performs well on the unseen data - well compared to some baseline or the current model - it can be considered as a replacement for the current algorithm. ## The Work Never Ends It might seem that once the model is done and validated that the work is over. Unfortunately, that's not true. It still remains to document the model, the data, and the results. It's not enough to have a working model: we need to have some idea of why it works. Take some time to go through the code and the model. Examine the features that get used. Think about why the algorithm selected is the right one for the problem. Remember that this still a human-in-the-loop process, and it requires good human thinking to complete effectively.