Notes on Anti-Sybil Microservices pre Round 11

# Notes on Anti-Sybil Microservices pre Round 11 ###### tags: `gitcoin` `Notes` :::info Updated by August 2021 ::: *Authors: Charlie Rice* _Processed notes from a meeting on 27 August 2021_ ## DELIVERABLES: * Microservices (Emanuel) * First dry-run/walk through during R&D on 1 September * Second dry-run on Friday 3 September * Launch with Gitcoin Round 11 on 8 September * Notebook on how to use the microservices * Analytics (Jesse) * One notebook presenting summary statistics from the 12h ML processes * One notebook for ad-hoc analysis at the end of the round for future recommendations ## Notes Gitcoin (GC) team has requested report on anti-Sybil activity that they can make public on their website every 12 hours. Considering that the ML pipeline includes personally identifiable information (PII) such as Github usernames, this is unreasonable and unethical to make public. However, it is possible to provide a report of summary statistics that can inform the community of anti-Sybil activity. There will two classes of reports: public and private. Both reports will take the form of 'raw' csv or Jupyter notebook files with the relevant data. The public report will remove personal or sensitive information and present summary statistics intended for public consumption, not decision-making. It will include, for example, % of users the model predicts to be Sybilline, the number of identified or predicted attacks, the distribution of users and confidences etc. The private report will include data from the `Features`, `Labels`, and `List of Flagged Users` data sources. This report will be used by Gitcoin for their decision making around removing users from the round. It is expected that the Gitcoin team will be doing most, if not all, of the in-depth analytics and business intelligence work with this data. One of the biggest challenges around this project is the multiplicity of timescales that we are working with. The microservices themselves run very quickly. The data collection and base ML pipeline are supposed to run every 12 hours. Half a day is not really sufficient time to do data analysis well, given the expected volume of data, so the human-in-the-loop semi-supervised learning processes are expected to run every week or so. Additional generation of labels by human subject matter experts can be done on an as-needed basis. There is still an ongoing issue around the metabase which was scrapped by the GC team. It is understood that a replacement will be available by Monday 30 August 2021. Emanuel will do a dry-run of the microservices using data from a previous round on Wednesday 1 September during the R&D call. This will hopefully allow the identification of bugs before a second run on Friday, prior to Round 11 beginning on Monday 8 September. Although the ML microservices are expected to run every twelve hours, it will probably be useful to manually run the process within the first three to six hours of the round. This is when a lot of activity happens with gitcoin rounds, and will probably provide a useful data for future labeling.