BlockChain: Detecting services through community connection abnormalities

# BlockChain: Detecting services through community connection abnormalities ## Introduction In this report, we tackle the problem of detecting services in the blockchain network relying on the structural features of the blockchain graph as well as features related to temporal behaviour of addresses and transactions. One service type is mixing services that aim in part at increasing anonymity of transactions, which can create a bed for fraudulent transactions to rise. Being able to identify these services can make it easier to track accounts that interact with them and potentially check if they are involved in criminal activities. One interesting element to these services is that select people tend to interact with them which makes their behaviour verge on the abnormal when compared with the rest of the accounts on the network. Looking into anomaly detection solutions to tackle this issue is our approach to identifying these services. The challenges that come with detecting such services stem from the lack of labeled addresses that makes it difficult to rely on supervised approaches, as well as the dynamic behaviour in which they complete transactions and the different transaction patterns. Our intuition is that people, similar to behaviours observed in the real world, tend to interact and make transactions with the people they are used to do business with and we believe similar behaviours would be observed when exchanging crypto-money. Moreover, this behaviour makes it that services will connect accounts that would usually have nothing to do with eachother. ## Methodology In here, we rely on this idea to identify services by firstly detecting communities in the blockchain graph and then identifying outliers as those that tend to not belong to any particular community, but rather connect said communities. >*Definition: Blockchain User Graph*        We construct a temporal Graph $G(V,E)$ with sets of nodes $V$ and edges $E$. The graph is directed and weighted where an edge $e_{ij}^t \in E$ represents a transaction in which an amount($w$) of money is transferred from account address $v_i$ to $v_j$ at time $t$. We denote $\tilde{G}(V, \tilde{E})$ as the undirected unweighted version of $G$. The approach is two folds: (1) detecting community structures and then (2) identifying outliers based on an abnormality score. ### 1. Community detection For the first part, we turn to Label Propagation Algorithm (LPA) to identify graph communities. In particular, we rely on a version of LPA with synchronous updates; an approach that better fits distributed platforms and is easily parallelised. Algorithm 1 describes the Synchronous LPA. --- #### Algorithm 1: Synchronous Label Propagation Algorithm __Inputs__ : Graph $\tilde{G}(V,\tilde{E})$ __Outputs__: Community labels $\mathcal{L}$. Initialize every $v \in V$ with a unique label $l_v$. While ($vote < |V|$){        For $v \in V$ {               Get the neighborhood $\mathcal{N}_v$ of $v$.              Get labels $\mathcal{L}_u$ of nodes $u \in \mathcal{N}^+_v$              Set new label $\hat{l}_v$ as label $l \in \mathcal{L}_u$ with highest occurrence.              If ($l_v = \hat{l}_v$) then                    $vote \mathrel{+}=1$              else                    Update $v$ label with $\hat{l}_v$.       } } --- ### 2. Outlier identification With the convergence of the first step, every node in the graph is assigned a label that defines its belonging to a community. The second step involves assigning an outlierness score to every node given the community structuring of the graph. Sticking to a vertex-centric approach, we define a local outlier score that reflects the strength of a node's connection to its community. >*Definition: Local Outlier Score*        $\forall{v} \in V$, an outlier score $\rho_{v}$ is defined around the neighborhood $\mathcal{N}_v$ and their labels as follows; $$ \rho_{v} = \frac{\sum_{k \in \mathcal{N}^+_v} \delta(l_v, l_k)}{|\mathcal{N}^+_v|} $$ where $\mathcal{N}^+_v = \mathcal{N}_v \cup \{v\}$ ; $l_v$ is the label of node $v$; and $\delta$ is defined as; $$ \delta(l_i, l_j) = \begin{cases} 1, & \text{if $l_i \neq l_j$},\\ 0, & \text{otherwise}. \end{cases} $$ So far, the approach is based on the idea that these services connect multiple communities without necessarily having a strong connection to either one of them. Given the dynamism of these systems, we take advantage of the temporal aspect of these graphs and hypothesise that this inherent behaviour should persist as we grow the graph snapshots around a particular time. In other words, as we increase the number of events we're observing to construct the graph $\tilde{G}$, the communities grow in size and become denser, but services should still connect to these communities without being swallowed by any. Subsequently, a last step to the approach looks at snapshots of the graph around time $t$ by varying windows $\Delta$ and flags those nodes that persistently have a high outlier score as services. Formally; $$ \psi_v = \frac{\sum_{\tau \in \Delta} \sigma(\rho_v^\tau) }{|\Delta|} $$ where; $$ \sigma(\rho) = \begin{cases} 1, & \text{if $\rho \gt \theta$},\\ 0, & \text{otherwise}. \end{cases} $$ $\theta$ is a user-defined threshold that, if set high, will return a tight set of outlier accounts flagged as services. ## Initial results True positive rate, Recall and G-mean are some good metrics to test the validity of the approach, to mention a few. However, with a lack of ground truth in the form of tagged accounts, it is difficult to do so. The following plots show outlier score for a set of entities with varying snapshot sizes $\Delta = \{day, week, month\}$. All of these accounts have had an outlier score $\rho > 0.5$ at some point in the study period. [Note: $\rho^t = 0$ on the plot means the score was smaller than 0.5 at time $t$.] ###### mixing services ![](https://i.imgur.com/Ua8EL3x.png) ![](https://i.imgur.com/uykTy0F.png) ###### market account ![](https://i.imgur.com/GIlgbC7.png) ###### unknown services ![](https://i.imgur.com/4e9LZIo.png) ![](https://i.imgur.com/t9zzCFN.png) ![](https://i.imgur.com/c3WeBMM.png) ## What's next? So far, the approach relies solely on the network structure to identify services and barely takes advantage of the temporal characteristic of the system. We plan to build on this by incorporating temporal motifs specific to these services. The authors of \[1\] laid the groundwork by identifying certain motifs both temporal and attribute-based that are very common in mixing services patterns of transactions. |                  ![image](https://i.imgur.com/rqznLaH.png)                      |                ![Fig.2](https://i.imgur.com/p8g207c.png)                    | |:------------------------------------------:|:-----------------------------------------:| | **Fig.1. Temporal motifs** | __Fig.2. Attribute-based motifs__ | For instance, the temporal motifs of Fig.1 .$\alpha$ shows that mixing services are more likely to receive money before relocating it. Also, from Fig.2, mixing accounts tend to disperse the money to multiple addresses. Fig.2.$\beta_2$ in particular shows that the average money coming into these accounts tend to be larger than the outgoing resulting in a positive balance. Next for this work is to incorporate these motif measures, amongst others, in the outlier score formulation to improve the detection of services. ###### References `[1] Wu, Jiajing, et al. "Detecting Mixing Services via Mining Bitcoin Transaction Network with Hybrid Motifs." _arXiv preprint arXiv:2001.05233_ (2020).` ###### tags: `research-drafts` `cbod` `blockchain`