Anti-Sybil Microservices Description

--- tags: Book --- # Anti-Sybil Microservices Description ## Functional Description ![Dependency Diagram](https://lucid.app/publicSegments/view/9a786de8-76f2-4c9d-b617-43e2757037c5/image.png) ## Scoped Execution Flows - Near Real Time Flagging (Data Retrieval + Base ML Process) - Expected interval: each 6h - Expected required roles: none (automated) - Semi-Supervised Process (Human Evaluation + Base ML Process) - Expected interval: every week - Expected required roles: Operator + Human Evaluators + Coordinator - SME Intervention (Prepare Features + Generate Heuristic Labels + Fit Predict) - Expected interval: on-demand - Expected required roles: SME + Operator - Sybil Reporting (Data Retrieval + Base ML Process + Data Analytics) - Expected interval: End of round + on-demand - Expected required roels: SME + Operator + Coordinator ## Datasets (Cylinders) Descriptions Generally, those datasets should be CSV or JSON files contained on a Data Lake, like a Google Cloud Storage Bucket. Consuming them is the action of retrieving the lastest version by timestamp of the existing files. Updating them is the action of writing a new version by using the execution timestamp as a prefix. Filenames and folder structures should be standardized, As a suggestion, we would have the following folders on the Bucket root: - 1_retrieved_contribs - 2_retrieved_github_profiles - 3_features - 4_labels - 5_predict_results - 6a_evaluator_flags - 6b_sanction_results - 7_reports Files could use the following pattern: `{TIMESTAMP}_{filename}.csv`, exemplified by `2021-08-01T09:04:10_github_profiles.csv` ## Microservices (Blue Box) Descriptions ### Retrieve Github Profiles - Description: Scrap Github profile stats from their REST API - Input: a set of GitHub profile names - Output: a CSV / DataFrame containing the profile stats as columns, and profiles as rows - How: by using GitHub API and iterating over an list of existing users based on the Contribs dataset. Desirable features: - (Only scrape new users) Instead of scraping all GitHub profiles on each run, it could be desirable if the microservice checks the existing users on the old Github Info tabular dataset and compares against the ones that do exist on the Contribs dataset. The GitHub Info dataset would then be appended with the new rows. ### Retrieve Contributions - Description: Download Grant Contributions rows from Metabase - Input: A range of dates - Output: A clean CSV / DataFrame containing properties for each contribution during the range of dates - How: By using Metabase REST API together with pre-existing questions. This will pull a JSON that must be parsed and prepared into a suitable dataset for analysis (eg, parsing the `normalized_data` column) Desirable features: - Using a API or Data View provided by Gitcoin Ops rather than scraping through Metabase ### Prepare Features - Description: Perform any required feature engineering operation and join the contributions data on the user profile data through aggregations - Input: CSV for user Github info and another CSV for the contributions - Output: a CSV with features per user ### Generate Heuristic Labels - Description: Assign labels based on "obvious" criteria as defined by SMEs, like associations with duplicate IPs. Heuristic logic looks at the creation and update dates for github profiles, clustering for IP addresses, and the number of public repos on the github profile, among others. Heuristic code found in `prepare_features.py` script. - Input: User Feature Dataset - Output: Labels Dataset is updated ### Fit Predict - Description: Perform usual steps associated with a ML flagging process (eg, splitting data, cross validation, training metrics) and generate a set of predictions provided a feature and label datasets. - Input: User Feature Dataset and Labels Dataset - Output: A serialized object containing info like: Tabular dataset of prediction results, performance metrics, and any meta-data ### Prepare Sanction - Description: Generates a list of users to be flagged according to an threshold parameter - Input: Prediction Results and a numerical Threshold value - Output: A list of users ### Prepare Evaluation - Description: Generates a randomly-sampled list of users to be manually labelled. Each human labeller has their own distinct list. - Input: Prediction Results - Output: A list of list of users based on the number of evaluators and the number of samples ### Prepare Human Evaluation Sheet - Description: Prepares a new Google Sheets Spreadsheet that contains the randomly-sampled lists as rows on each sheet. - Input: The list of list of users to be evaluated - Output: A Google Sheets Spreadsheet where each sheet contains a list of users to be evaluted for a given evaluator. ### Retrieve Human Labels - Description: Consumes the Spreadsheet and parses each sheet so that a map between evaluated users -> label is generated and appended to the existing label dataset. - Input: A Google Sheets Spreadsheet where each sheet contains a list of users to be evaluted for a given evaluator. - Output: Labels dataset is updated - Note: in the event that multiple human reviewers look at the same user, the algorithm will need to see certainty (1) in the `is_sybil` and `sybil_certainty` columns for all reviewers to mark a user as sybil. ### Prepare Report - Description: Public Report and Private Report - Input: Prediction Results object - Output: - Public Report: Generates a "standard" notebook containing key summary metrics about the automated flagging - Private Report: CSV which will have the data from the “Features”, “Labels”, and “List of Flagged Users”