# Study on GR14 Flagging Performance ###### tags: `GR14` ## Context On the [Sybil Report on Gitcoin Grants Rounds 14](/UOlkSVK0QMSEsKEB3fUFjQ), the FE was estimated on 160% with it being significantly higher than the estimated incidence. This means that the SAD is over-flagging and the main source seems to be the ML algorithm itself. What it is possible to do? ## Recommendation - Increase the `ml_threshold` to 99% if we want to minimize false positives. ## Conclusion - The over-flagging was caused because of the ML model being biased towards heuristics due to a weighted combination of 1. an drastic reduction on the number of human evaluations during GR14 2. an increase on sybil incidence 3. an increase on the relative count of users being flagged due to heuristics ## Notes - The Sybil Incidence went up by 1.6x and the Heuristics Incidence up by 2.1x. The numbers for both in on the same magnitude that we would expect when inspecting [Sybil Report on Gitcoin Grants Rounds 13](/d43PnkvxRHqnqduxAUgl4A) and [Sybil Report on Gitcoin Grants Rounds 12](/HXQ3yWfBQN6QGQX4cmOAxQ), etc. - The main source for over-flagging seems to be the ML algorithm, which usually tends to be less aggressive than the humans. - Hypothesis: the reduced size of Human Evaluations coupled together with more Heuristics flags biased the algorithm to be more aggressive. - Evidence 1: The relative number of human evaluted users went down from 37% to 6% (16% than before), and the relative number of heuristics is now 210% than before. - Evidence 2: If we take the metric (Human Flags) / (Human Flags + Heuristic Flags), we have: - GR14: 30% - GR13: 83% - GR12: 63% - GR11: 88% - Conclusion: On past rounds, the label dataset consisted mainly of human flags. This is not the case on GR14, on which Heuristics provides most of it. - Evidence 3: If we take the metric (Positive Heuristic Flags / Heuristic Flags) / (Positive Human Flags / Human Flags) as an indicator of relative aggressiveness of the Heuristics, we have: - GR14: 4.0x - GR13: 6.2x - GR12: 2.9x - Conclusion: The Heuristics are sistematically more aggressive relatively than Human Evaluations - Overall conclusion: There's an significant risk that the ML algorithm became more agressive as a result of the reduced human evaluations on GR14 - Recommendation: Decrease the algorithm agressiveness - By how much to reduce? Based on an EDA, the ML predictions is generating three modes on `prediction_score`: $[0, 40\%], [75\%, 90\%], [90\%, 100\%]$ - An working hypothesis is - The first mode is generated by the combined negative flags of both evaluators and heuristics - The second mode is generated by the relatively uncertain evaluations because of the evaluators - The third mode is generated by the consensus of both evaluators and heuristics - Recommendation: increase `ml_threshold` - Option 1: from 80% to 90% - Impact: Users being flagged due to algorithms will decrease from 27% to 18%. This will reduce the FE from 160% to 123%, which is still not compatible with the incidence, but an great part of the difference will be gaped. - Option 2: from 80% to 99% - Impact: Users being flagged due to algorithms will decrease from 27% to 13%. This will reduce the FE from 160% to 99%. This represents the best course of action if we want to minimize False Positives given the bias on the current ML model ![](https://i.imgur.com/lzllxGH.png)