Sybil Users CSV Spec

# Sybil Users CSV Spec ## Columns - `handle` (id, non-nullable) - Github / Gitcoin user handle. Not exhaustive. - `aggregate_score` (float, non-nullable) - Suggestion of what users to mark as sybil based on a prioritization logic. - Either 0.0 or 1.0. See below description. - `prediction_score` (float, non-nullable) - Represents the ML confidence on a given user being sybil - Real Number between 0 and 1. - `evaluation_score` (float, **nullable**) - Represents the normalized sybilness score of the human evaluations - Real number between 0 and 1. Empty if no data is available - `heuristic_score` (float, **nullable**) - Represents how much a user is sybil according to SME heuristics - Either 0.0 or 1.0. Empty if no data is available. - `feature_1` (float, non-nullable) - `feature_2` (float, non-nullable) - ... - `feature_n` (float, non-nullable) ## Metrics ### `evaluator_score` | Score | Is Sybil | Confidence | | -------- | -------- | -------- | | 0.0 | F | high | | 0.333 | F | low | | 1.0 | T | high | ### `aggregate_score` By assuming that a user should be classified as sybil based on specific thresholds as well as assuming that there's a importance order (evalution_score > heuristic_score > prediction_score), the following algorithm is proposed for computing an "aggregate score" Pseudo-code: - If evaluation_score is null - if heuristic_score is not null then heuristic_score = aggregate_score - else - if prediction_score >= ML_THRESHOLD then 1.0 - else 0.0 - Else: - if evaluation_score >= EVAL_THRESHOLD then 1.0 - else 0.0 Currently, ML_THRESHOLD = 0.5 (accuracy maximization) and EVAL_THRESHOLD = 0.8 (is_sybil=True & confidence >= so-so)