# Mini Hackathon for Flagging Gitcoin Attackers ## Input - Training Dataset - Initially small, but to be boostraped through iterations - Has the same features of the contributions dataset, but with a 'suspected_attacker' boolean column. - Contributions dataset (target dataset) - The same structure from the repo ## Output A map that associates a user to a number between 0 and 1 ## Pipeline (training_user_data) -> (training_contributions_data) ## Features (contrib-level) Dimensions (non-orthogonal): Contrib, User, Wallet, Grant - Label (user) - Sucess (contrib) - Median USDT amount (user) - Median USDT amount (wallet) - Number of contributions (user) - Number of contributions (wallet) - Associated IP address count with user (user) - Associated IP address count with wallet (wallet) - Count of users with the same wallet (user vs wallet) ## Features (user-level) ## Methodology The main strategy is to iterate quickly by using a mix of context-specific feature engineering, exploratory data analysis and the usage of ensemble Machine Learning methods for testing threshold based flagging algorithms. The approach consists of as follows: 1. We generate a training dataset by taking a subset of the full data and labelling the grants / contributions manually using context knowledge. 2. New features are added / transformed 3. A ensemble ML model is trained. - If the ML model has good enough performance metrics, we proceed, else, we go back to step 2 or 1. 4. The ML model makes predictions on a larger dataset that is still a subset. 5. EDA is performed into the projections, and new knowledge is incorporated. 6. We go back to step 1 with a larger amount of data. - Enriching the contributions through grant properties & graph properties - Work through subsets of grants in increansgly larger count of grants - Decision trees and LASSO logistic regressions - RF for measuring importance