Feature Engineering for Gitcoin ASOP

# Feature Engineering for Gitcoin ASOP Feature engineering is, in some ways, the heart and soul of doing data science and machine learning. It's where we take observations about the world and try to turn them into machine-readable datapoints. Or, where we take machine-readable data and create brand new data. Feature engineering is also one of the places where people who don't necessarily know a lot of machine learning can make a serious contribution. All it takes is some keen observational skills. As has been noted, it's very difficult to discern true Sybil attacks from genuine community interest in a project. But it is not impossible. Ultimately, it is about identifying connections among contributors to see which ones may be attempting to collude. Let's take just one example of how a contributor could use available data to engineer features for the machine learning algorithm. ## An Observation Made, A Feature Created One FDD contributor, call them Alice, remembers that Sybil attacks are malign collusion. That is, one user pretends to be many users in order to affect the funding algorithm. They notice in the data that a lot of IP addresses for contributors to one particular project are identical. The usernames are all different, but the actual IP address - the thing that differentiates one computer from an other on the internet - is the same. Identifying identical IP addresses logged to different users seems like a very clear indication that something is fishy. Unless they're using a shared computer, a group of gitcoin contributors should all have different IP addresses. So Alice has an excellent idea for a feature: finding repeated IP addresses that aren't linked to the same username. Now that they have the idea, Alice has a couple of options. They can code it up in Python themselves: the ASOP uses Python and Pandas for data manipulation. Or they can connect with the FDD in the Gitcoin community to find a machine learning engineer or data scientist to help them turn their idea into a useable feature. The idea demonstrated here is already in use, but fortunately, there is [a backlog](https://github.com/gitcoindao/sybil-detection-backlog) of feature ideas that haven't been developed yet. If you're interested in contributing new features for the machine learning process, this would be a great place to start. ## Workflows for contribution on Feature Engineering ### Observing & Reviewing Sybil Patterns - **Sybil Pattern Observation (anyone)** 1. Take notes of any odd behaviour related to sybils, or any suggestion of what definitely is not sybil. 2. Go to the *["Observation of Sybil Patterns"](https://github.com/gitcoindao/sybil-detection-backlog/projects/1)* and see if anyone has taken note of the observed patterns. 3. If no, then add a note to the *"User Story" *column describing what was observed. **Done!** - **Pattern Observation Filtering (anyone technical)** 1. Pick a note at the *"User Story"* column on the *["Observation of Sybil Patterns"](https://github.com/gitcoindao/sybil-detection-backlog/projects/1)* 2. Assert that it is not obviously duplicate of another note that's on the *"Pending Technical Review"*, *"Reviewed"* or *"Won't do"* sections. 3. Assert that the description is informative enough for pursuing an more directed research related to the topic 4. If yes for both of above, move to the "Pending Technical Review". Else, move to "Won't do". **Done!** - **Observed Pattern Review (data analyst / scientist)** 1. Pick a note at the *"Pending Technical Review"* column on the *["Observation of Sybil Patterns"](https://github.com/gitcoindao/sybil-detection-backlog/projects/1)* 2. Break down the observation into a set of candidate per-user metrics along with any comments. - This may require slow work so that the measurements are contextualized provided the availability of data sources. 3. Create one story for each candidate per-user metric on the [*"Measurements for Sybil Behaviour"*](https://github.com/gitcoindao/sybil-detection-backlog/projects/3) backlog 4. Move the selected note to the *"Reviewed"* column. **Done!** ### From Measurements to Features - **Measurement Data Analysis (data analyst / scientist)** 1. Pick a measurement on [*"Measurements for Sybil Behaviour"*](https://github.com/gitcoindao/sybil-detection-backlog/projects/3) and move it to "Pending" 2. Perform any required exploratory data analysis and research / validation work and move it to "Analysis Done" with the conclusions and rationale for approving or not approving into the feature set. **Done!** - **Feature Peer Review (senior data analyst / scientist)** 1. Pick a measurement on [*"Measurements for Sybil Behaviour"*](https://github.com/gitcoindao/sybil-detection-backlog/projects/3) on the "Analysis Done" column 2. Decide whanever to approve / disapprove / ignore the inclusion of the measurement into the feature set. Move it to the respective column. 3. If the measurement is approved, then a story should be created on the [*Feature Engineering backlog*](https://github.com/gitcoindao/gitcoin_asop/projects/1) . **Done!** - **Feature Implementation (software developer)** 1. Select a story on the [*Feature Engineering backlog*](https://github.com/gitcoindao/gitcoin_asop/projects/1) and mark it as pending. 2. Implement the measurement on the feature set according to the existing set of best practices