# PageRank Algorithm ###### tags: `Algorithm` `Multithread` ## Usage: The WWW hyperlink structure form a huge directed graph where the nodes represent web pages and directed edges are the hyperlinks ```graphviz digraph hierarchy { nodesep=1.0 // increases the separation between nodes node [color=Black,fontname=Times,shape=ellipse] //All nodes will this shape and colour edge [color=Blue, style=dashed] //All the lines look like this Page1->{Page2} Page2->{Page1} Page4->{Page2} Page3->{Page2 Page4} {rank=same;Page1 Page2} {rank=same;Page4 Page3} // Put them on the same level } ``` * Inbound link: these are links into the given site from outside so from other pages * Outbound link: these are links from the given page to pages in the same site or other sites * Dangling links: these are links that point to any page with no outgoing links **Breadth-first search** : usually used for web crawling ## The oringinal formula $$ PR(A)=(1-d)+d({PR(T_1) \over C(T_1)} + {PR(T_2) \over C(T_2)} +...+ {PR(T_n) \over C(T_n)}) $$ * PR(A) page rank of A which depends on other pages page rank * $$PR(T_i)$ page rank of pages Ti which link to page A * $$C(T_i)$ number of outbounds links on a given Ti page * d damping factor in range 0 to 1 * **Page ranks have to be initialized to 1/n at the beginning.** ## The iterative formula $$ PR_{t+1}(P_i)=\sum\limits_{P_j}{PR_t(P_j) \over C(P_j)} $$ * Pj are the node that are pointing to Pi * $$C(P_j)$ number of outbounds links on a given Pj page ## Example for caculating PageRank algorithm ```graphviz digraph hierarchy { nodesep=1.0 // increases the separation between nodes node [color=Black,fontname=Times,shape=ellipse] //All nodes will this shape and colour edge [color=Blue, style=solid] //All the lines look like this A->{B C} B->{D} C->{A B D} D->{} {rank=same;A B} } ``` | | Iteration 0 | Iteration 1 | Iteration 2 | PageRank | | ---- | ----------- | ----------- | ------------ | ----------- | | A | 1/4 | 1/12 | 3/24 | 1 | | B | 1/4 |1/8+1/12=5/24| 1/6 | 2 | | C | 1/4 |1/8+1/4=3/8 | 3/8 | 4 | | D | 1/4 |1/4+1/12=1/3 | 1/3 | 3 | Steps: 1. Initialize all node to 1/n which n equals 4 2. Use PR in iteration 0 to caculate iteration 1 by iterative formula 3. and caculate iteration2 and iteration3 using the same method ## Why can't we hack google by a trivial website with numerous links to a specific topic? By the iterative formula, the rank is summing up ranks of nodes that point to the subject. Nodes that you point will hardly influence your page rank. Node D does not point to any other node(CD and BD are dangling links), but it's page rank is the second high because the most important node('C') points to it. ### Matrix Representation(maybe suitable for parallelization) ```graphviz digraph hierarchy { nodesep=1.0 // increases the separation between nodes node [color=Black,fontname=Times,shape=ellipse] //All nodes will this shape and colour edge [color=Blue, style=solid] //All the lines look like this A->{B C} B->{D} C->{A B D} D->{} {rank=same;A B} } ``` ### adj Matrix $$H=\begin{pmatrix} 0 & 1 & 1 & 0\\\ 0 & 0 & 0 & 1\\\ 1 & 1 & 0 & 1\\\ 0 & 0 & 0 & 0\end{pmatrix}$$ ### weighted adj Matrix $$ \text{replace 1 with }{1 \over C(P_j)} $$ ### adj Matrix $$H=\begin{pmatrix} 0 & 1/2 & 1/2 & 0\\\ 0 & 0 & 0 & 1\\\ 1/3 & 1/3 & 0 & 1/3\\\ 0 & 0 & 0 & 0\end{pmatrix}$$ ### Matrix Representation of iterative module: $$v_i\text{ means the ith iteration for rank vector}$$ $$v_{i+1} = H^T_{adj}v_i$$ $$v_{i+n} = {H^T_{adj}}^nv_i$$ ### Steady State Approach to solve Page Rank Problem $$\text{v is any of the eigenvector of H}$$ $$v = H^Tv$$ **v is to be viewed as the final page rank vector** Another approach can come to the same result: **random surfer approach** ### Random Surfer Approach ![](https://i.imgur.com/ZyhOkzN.png) ### Problems of this iterative formula for PageRank 1. Dangling points: 2. Disconnected components: #### Dangling points ```graphviz digraph hierarchy { nodesep=1.0 // increases the separation between nodes node [color=Black,fontname=Times,shape=ellipse] //All nodes will this shape and colour edge [color=Blue, style=solid] //All the lines look like this A->{C} B->{C} {rank=same;A B} } ``` Weighted adj Matrix H and initial PageRank vector: $$H=\begin{pmatrix} 0 & 0 & 1\\\ 0 & 0 & 1\\\ 0 & 0 & 0\end{pmatrix} , v_0=\begin{pmatrix} 1/3\\\ 1/3\\\ 1/3\end{pmatrix}$$ We can try iterative modula for PageRank algorithm for two times: $$v_2 = (H^T)^2v_0=v_0=\begin{pmatrix} 0\\\ 0\\\ 0\end{pmatrix}$$ Intuitively, we can come to the conclusion that C is more importantly than A and B because more relative pages points to C than A and B. Apparently, the original formula does't work for this scenario. #### Disconnected components ```graphviz digraph hierarchy { nodesep=1.0 // increases the separation between nodes node [color=Black,fontname=Times,shape=ellipse] //All nodes will this shape and colour edge [color=Blue, style=solid] //All the lines look like this A->{B C} B->{A C} D->{E} E->{D} {rank=same;A B} } ``` The original iterative formula can't handle this situation which seperated part exists. #### And then? -> Modify the original formula by observing web user's behavior: When We choose to continue surfing the net, we often try links on the current page but sometimes we research the topic and choose the page that's not pointed by the current page. ## Final Formula: Formula with damping factor ### Use damping factor in random surfer approach $$d:\text{damping factor in range 0 to 1, usually 0.15}$$ $$M:\text{the page rank matrix}$$ $$A:\text{transition matrix}$$ $$B:\text{identity matrix, which has all components with value 1}$$ $$M=(1-d)A+dB$$ * Most of the time, with (1-d) probability, the surfer will follow links in the given page. * Sometimes, the surfer will leave the actual page and nevigate to another one.