# Mini Hackathon for Flagging Gitcoin Attackers
## Input
- Training Dataset
- Initially small, but to be boostraped through iterations
- Has the same features of the contributions dataset, but with a 'suspected_attacker' boolean column.
- Contributions dataset (target dataset)
- The same structure from the repo
## Output
A map that associates a user to a number between 0 and 1
## Pipeline
(training_user_data) -> (training_contributions_data)
## Features (contrib-level)
Dimensions (non-orthogonal): Contrib, User, Wallet, Grant
- Label (user)
- Sucess (contrib)
- Median USDT amount (user)
- Median USDT amount (wallet)
- Number of contributions (user)
- Number of contributions (wallet)
- Associated IP address count with user (user)
- Associated IP address count with wallet (wallet)
- Count of users with the same wallet (user vs wallet)
## Features (user-level)
## Methodology
The main strategy is to iterate quickly by using a mix of context-specific feature engineering, exploratory data analysis and the usage of ensemble Machine Learning methods for testing threshold based flagging algorithms.
The approach consists of as follows:
1. We generate a training dataset by taking a subset of the full data and labelling the grants / contributions manually using context knowledge.
2. New features are added / transformed
3. A ensemble ML model is trained.
- If the ML model has good enough performance metrics, we proceed, else, we go back to step 2 or 1.
4. The ML model makes predictions on a larger dataset that is still a subset.
5. EDA is performed into the projections, and new knowledge is incorporated.
6. We go back to step 1 with a larger amount of data.
- Enriching the contributions through grant properties & graph properties
- Work through subsets of grants in increansgly larger count of grants
- Decision trees and LASSO logistic regressions
- RF for measuring importance