Node2Vec - HackMD

# Node2Vec ## Introduction Based on the social graph with nodes that represent people (user of the online marketplace) and oriented edges that represent the trusting relationship between people, the goal of this assignment is to predict whom each user most likely to trust next. To estimate the probability of trusting each of the user we use Node2Vec model to compute 10 most probable people for each user to trust. To evaluate the quality of the results we use the Mean Average Precision method. ## The Team Link to the [Trello](https://trello.com/invite/b/qIJDyz3F/aed0a7b51d3200698e80c8bfeac92501/assignment) and [git](https://github.com/artemkurylev/Node2vec) Sobir Bobiev * @sobir_bobiev * git - sobir-git Artem Kurylev * @artikot * git - artemkurylev Marina Leushina * @marina_leu * git - Leushina ## The Model We use simplified Node2Vec model which is a simple neural network with one hidden layer. We utilize the simplified version of negative sampling. To train the model we use stochastic gradient descend to minimize the objective function which is binary cross-entropy. ![The model diagram](https://i.imgur.com/nIzkBGJ.png) Let $V = In$ be the input embedding matrix and $W=Out$ be the output embedding matrix. We denote by $X_i$ the $i^{th}$ column of the matrix $X$. The loss function for a batch $B$ is defined as $$ J_B(W,V) = \frac{1}{|B|}\sum_{(u, n)\in B} J_{u, n}(W, V) $$ where $J_{u, n}$ is defined as the loss contributed by the sample edge $(u \to n)$: $$ J_{u, n}(W, V) = -\log(\sigma(W_n^T V_u)) -\sum_{k\sim G(u)} \log(1-\sigma(W_k^T V_u)) $$ We will compute partial derivatives of $J_{u, n}(W, V)$. Since the expression is a bit complex we will compute partial derivatives of each individual term and then results can be summed up. The partial derivatives for the first term(without negative sign) are $$ \begin{equation}{ \frac{\partial}{\partial V_u}\log(\sigma(W_n^T V_u)) \\ = \frac{\sigma(W_n^T V_u) (1 - \sigma(W_n^T V_u)) W_n}{\sigma(W_n^T V_u)}\\ = (1 - \sigma(W_n^T V_u)) W_n } \end{equation} $$ and $$ \begin{equation}{ \frac{\partial}{\partial W_n}\log(\sigma(W_n^T V_u)) \\ = \frac{\sigma(W_n^T V_u) (1 - \sigma(W_n^T V_u)) V_u}{\sigma(W_n^T V_u)}\\ = (1 - \sigma(W_n^T V_u)) V_u } \end{equation} $$ The partial derivatives for a single term of the second term $\sum_{k\sim G(u)}$ are $$ \begin{equation}{ \frac{\partial}{\partial V_u}\log(1 - \sigma(W_k^T V_u)) \\ = \frac{-\sigma(W_k^T V_u) (1 - \sigma(W_k^T V_u)) W_k}{1-\sigma(W_k^T V_u)}\\ = -\sigma(W_k^T V_u) W_k } \end{equation} $$ and $$ \begin{equation}{ \frac{\partial}{\partial W_k}\log(1 - \sigma(W_k^T V_u)) \\ = \frac{-\sigma(W_k^T V_u) (1 - \sigma(W_k^T V_u)) V_u}{1-\sigma(W_k^T V_u)}\\ = -\sigma(W_k^T V_u) V_u } \end{equation} $$ Note that the gradients for the batch loss is the average of the gradients for the individual sample losses $$ \frac{\partial J_B(W, V)}{\partial W} = \frac{1}{|B|}\sum_{(u, n)\in B} \frac{\partial J_{u, n}(W, V)}{\partial W} $$ and $$ \frac{\partial J_B(W, V)}{\partial V} = \frac{1}{|B|}\sum_{(u, n)\in B} \frac{\partial J_{u, n}(W, V)}{\partial V} $$ Since we computed the gradients for the individual sample losses, taking the average of them gives us the gradients for the batch loss, as desired. ## Loss Loss at the start of training: ![](https://i.imgur.com/9GY6sKX.jpg) Loss at the end of training: ![](https://i.imgur.com/hGYIReC.jpg) ## Training ### Algorithm ![](https://i.imgur.com/xO7zy7m.jpg) ### Parameters - Total Node: `totalNumNodes = 40333` - Size of embeddings: `embDim = 50` - Number of epochs: `epochNum = 20` - Learning rate: `learningRate = 1.0` - Batch size: `batchSize = 1000` - Number of negative samples per positive edge: `numNeg = 20` ### Training - Read the data - for training we use only train data; - Batch the data - for batching we use random split function; - Weights - we compute only results for current edge and negative samples, since other values are 0 and there is no point in multiplying it; - Forward - we use dot product for calculating; - Cost function takes the probability for the expected output node `n` and for the other negative nodes; - Loss function takes a batch, does feed-forward and compute the average loss(or cost) over this batch ```scala def computeLoss(batch: RDD[(Int, Int)], embInt, embOut) = loss: Float { costs = batch // has type RDD[(Int, Int)], i.e. RDD of edges .map{ edge => // define lambda probN, probsNeg = feedForward((edge._1, edge._2), embIn, embOut) computeCost( probN, probsNeg ) } costs.average() } ``` - Compute gradients using formulas for partial derivatives from above. - Update the weights with specified learning rate: $$ In = In - \alpha \frac{dJ}{dIn}\\ Out = Out - \alpha \frac{dJ}{dOut} $$ ## Prediction For prediction we calculated rank for each node to be connected with each node. To do this we used breeze library. ## Evaluation To evaluate the results obtained by the model, we need to: - Filter existing edges, since we can't take them into account as a success because model have already seen it while training. To do this we set the probability of trusting this nodes to 0. This way the nodes will not be included in the final predictions. - Sorting the nodes by its probability and slicing top 10 nodes. At this stages for each node top 10 nodes will be printed with its corresponded probability. After this we will loose the probability since we only need rank for the MAP method. - Write this to the file for further evaluation by MAP method. First we convert it as a string **MAP function** implemented using Python. Since for the problem that we solve we can't exactly calculate precision, recall or accuracy, we use MAP method. To compute mean of average precisions that will tell us how good the obtained recommendation is, we need to calculate average precision for each node. We compute average precision for top 10 most probable users that given user will trust. This means that number of total recommendations $R$ is equal to 10. $$AP = \frac{\sum_{k=1}^{R}Precision(k) \cdot rel(k)}{GTP}$$ For each $k$ of this recommendation we check the relevance $rel(k)$ that depends on the fact if this node exist in test data or not. If there is such edge, then our model provided good recommendation and relevance is 1. Such edges represent ground truth positives (number of them is $GTP$). Precision in this situation is the rank (position number from 1 to 10) of the node, edge with which exist in the test set. When computing this we use groupby function of Pandas DataFrame so that we don't go through the data many times. After this we compute the mean of average precisions: $$MAP = \frac{1}{N}\sum_{i=1}^{N}AP(i)$$ ## Testing with different parameters | Emb size | Number of negative sample | Number of epochs | MAP | Final Loss | --------- | ------------------------- | ---------------- | ---- | ----| | 50 | 20 | 25 | 2.19 | 2.81 | 75 | 30 | 25 | 2.20 | 2.78 | 10 | 20 | 25 | 2.19 | 2.74 | 100 | 100 | 25 | 2.18 | 4.72 | 29 | 79 | 15 | 2.19 | 4.51 As we can see from the table changing value of parameters doesn't effect on precision significantly. Based on that we can assume that model we use is not very good (because actually we dont use random walks within model as it was said in the original Node2Vec paper) or test data doesn't correspond to the train data.