# Chapter 14. Link Analysis and Web Search [toc] ## 14.1 Searching the Web: The Problem of Ranking - Synonymy - Polysemy - The diversity in authoring styles - Dynamic and constantly changing nature of Web content - How to filter important information from an **enormous** number of relevant documents ## 14.2 Link Analysis Using Hubs and Authorities - Assume page P is the best result, P may be in the links of relevant pages X. ### Voting by In-Links Find the page receiving the greatest number of in-links from relevant pages. ### A List-Finding Technique ![](https://i.imgur.com/8nxfGtw.png) - pages that compile lists of resources relevant to the topic. ![](https://i.imgur.com/tPkRPmF.png) ### The Principle of Repeated Improvement ![](https://i.imgur.com/Z781LAu.png) - repeated improvement ### Hubs and Authorities - Hubs: high-value lists - Authorities: highly endorsed answers ``` for each page p: auth(p) = 1 hub(p) = 1 k = TIMES while k--: // Authority Update Rule for each page p: auth(p) = sum of the hub scores of all pages that point to p // Hub Update Rule for each page p: hub(p) = sum of the authority scores of all pages that p points to. // normalize: we only care about their relative sizes sum_auth = sum of all authority scores sum_hub = sum of all hub scores for each page p: auth(p) /= sum_auth hub(p) /= sum_hub ``` ![](https://i.imgur.com/t0Mql6o.png) ![](https://i.imgur.com/qafz8As.png) ## 14.3 PageRank - Idea: nodes currently viewed as more important get to make stronger endorsements. ### The Basic Definition of PageRank - n: # of nodes - assign all nodes the initial PageRank 1/n - choose a number of steps k - k updates using Basic PageRank Update Rule ![](https://i.imgur.com/mHBCaTC.png) - $n=8$, init. PageRank $=\frac18$ - PageRank(A)= PR from D + PR from E + PR from F + PR from G + PR from H = $\frac12 \cdot \frac18+\frac12 \cdot \frac18+ \frac18+ \frac18+ \frac18= \frac12$ - A is an important page, we weigh its endorsement more highly in the next update ### Equilibrium Values of PageRank - PageRank values converge as k → $\infty$ (except in certain degenerate special cases) - If the network is strongly connected, then there is a unique set of equilibrium values. ![](https://i.imgur.com/Y8mtxc3.png) ### Scaling the Definition of PageRank - slow leak: F, G ![](https://i.imgur.com/63ZishQ.png) ::: success **Scaled PageRank Update Rule:** 1. Apply the Basic PageRank Update Rule. 2. Scale down all PageRank values by a factor of $s$, where $0\lt s \lt 1$. (Total PageRank has shrunk from $1$ to $s$.) 3. Divide the $1 − s$ units of PageRank over all nodes, giving $(1 − s)/n$ to each. ::: - Unique equilibrium ### Random Walks: An Equivalent Definition of PageRank. - The probability of being at a page X after k steps of this random walk = the PageRank of X after k applications of the Basic PageRank Update Rule. ## 14.4 Applying Link Analysis in Modern Web Search ### Combining link, text, and usage data * In addition to links, there are many other features as well, such as * **text** * **anchor text** : clickable text that activate a hyperlink leading to another page, eg. [NYCU Timetable](https://timetable.nycu.edu.tw/) * **click-through rate** ### A moving target * The search engine results would change in response to users' actions * The results mattered to a lot of people and companies, such as * Companies had business models depend on Google's result (tourism industry) * Website content writer * A large industry known as ***search engine optimization***, consist of search experts advise companies on how to create pages and sites that rank highly * The **"perfect"** ranking function will always be a moving target ## 14.5 Applications beyond the Web ### Citation Analysis * ***impact factor*** for scientific journal : Average number of citations received by a paper in the given journal over the past two years * ***influence weights*** for journals : Similar to the notion of PageRank for Web pages ## 14.6 Advanced Material: Spectral Analysis, Random Walks, and Web Search ### A. Spectral Analysis of Hubs and Authorities Our main goal will be to show the convergence of hub and authority scores. :::success **Notations**. We now view a set of $n$ pages as a set of nodes in a **directed graph**, and thus we can build the adjacency matrix of the graph. * Denote **adjacency matrix** of graph as $M$ * Denote the vector of **hub score** as $h$, and the hub score of node $i$ as $h_i$ * Denote the vector of **authority score** as $a$, and the authority score of node $i$ as $a_i$ ![](https://i.imgur.com/6LGcgsg.png) ::: #### Hub and Authority Update Rules as Matrix-Vector Multiplication. * **Recall** : * **Hub Update Rule** : Update *hub\($p$\)* to be the sum of the authority scores that it points to. * **Authority Update Rule** : Update *auth($p$)* to be the sum of the hub scores that point to it. * We could write update rule as * **Hub Update Rule** : $h_i \leftarrow M_{i1}a_1 + M_{i2}a_1 + ... + M_{in}a_n\Rightarrow h \leftarrow Ma$ * **Authority Update Rule** : $a_i \leftarrow M_{1i}h_1 + M_{2i}h_2 + ... + M_{ni}h_n \Rightarrow a \leftarrow M^Th$ #### Unwinding the k-step hub-authority computation :::success **Notations**. * Denote $a^{\langle k \rangle}$ and $h^{\langle k \rangle}$ as the vectors of authority and hub scores after $k$ updates. ::: We first find that $$a^{\langle 1 \rangle} = M^Th^{\langle 0 \rangle}$$ and $$h^{\langle 1 \rangle} = Ma^{\langle 1 \rangle} = (MM^T)^1h^{\langle 0 \rangle}$$ second $$a^{\langle 2 \rangle} = M^Th^{\langle 1 \rangle} = (M^TM)^1M^Th^{\langle 0 \rangle}$$ and $$h^{\langle 2 \rangle} = Ma^{\langle 2 \rangle} = MM^TMM^Th^{\langle 0 \rangle} = (MM^T)^2h^{\langle 0 \rangle}$$ One more step makes the pattern clear: $$a^{\langle 3 \rangle} = M^Th^{\langle 2 \rangle} = M^TMM^TMM^Th^{\langle 0 \rangle} = (M^TM)^2M^Th^{\langle 0 \rangle}$$ and $$h^{\langle 3 \rangle} = Ma^{\langle 3 \rangle} = MM^TMM^TMM^Th^{\langle 0 \rangle} = (MM^T)^3h^{\langle 0 \rangle}$$ We've found the pattern, and can write $$a^{\langle k \rangle} = (M^TM)^{k-1}M^Th^{\langle 0 \rangle}$$ and $$h^{\langle k \rangle} = (MM^T)^kh^{\langle 0 \rangle}$$ #### Thinking about multiplication in terms of eigenvectors * Hub and authority values tend to grow with each update, they will only converge when we take **normalization** into account. * We now show that there are constants $c$ and $d$ so that $h^{\langle k \rangle} \over c^k$ and $a^{\langle k \rangle} \over d^k$ converge as $k \rightarrow \infty$ :::info **Proof**. If $${h^{\langle k \rangle} \over c^k} = {(MM^T)^kh^{\langle 0 \rangle} \over c^k}$$ converge to $h^{\langle * \rangle}$, that is, $h^{\langle * \rangle}$ satisfy the equation $$(MM^T)h^{\langle * \rangle}=ch^{\langle * \rangle}$$ We found that $c$ and $h^{\langle * \rangle}$ are **eigenvalue** and **eigenvector** of $MM^T$ respectively ::: :::warning Recall from Linear Algebra: *Any symmetric matrix $A$ with $n$ rows and $n$ columns has a set of n eigenvectors that are all unit vectors and all mutually orthogonal — that is, they form a basis for the space $R^n$* ::: :::info **Cont.** $MM^T$ is symmetric, so $MM^T$ has $n$ mutually orthogonal eigenvectors $z_1, z_2, ..., z_n$ </br> with corresponding eigenvalues $c_1, c_2, ..., c_n$ (Let $|c_1| \geq |c_2| \geq ... \geq |c_n|$) </br> Now, given any vector $x = p_1z_1 + p_2z_2 + ... +p_nz_n$ (since $z_1, z_2,...,z_n$ are basis of $R^n$) we have $$ \begin{align} (MM^T)x &= (MM^T)(p_1z_1 + p_2z_2 + ... +p_nz_n) \\ &= p_1MM^Tz_1 + p_2MM^Tz_2 +...+p_nMM^Tz_n \\ &= p_1c_1z_1 + p_2c_2z_2 + ... + p_nc_nz_n \\ \Rightarrow (MM^T)^kx &= c_1^kp_1z_1 + c_2^kp_2z_2+...+c_n^kp_nz_n \end{align} $$ Now consider $h^{\langle k \rangle}=(MM^T)^kh^{\langle 0 \rangle}$ and let $h^{\langle 0 \rangle} = q_1z_1+q_2z_2+...+q_nz_n$ we have that $$h^{\langle k \rangle}=(MM^T)^kh^{\langle 0 \rangle}=c_1^kq_1z_1 + c_2^kq_2z_2+...+c_n^kq_nz_n$$ divide both sides by $c_1^k$ $${h^{\langle k \rangle}\over c_1^k}= q_1z_1 + ({c_2\over c_1})^kq_2z_2+...+({c_n\over c_1})^kq_nz_n$$ thus, ${h^{\langle k \rangle}\over c_1^k} = q_1z_1$ as $k \rightarrow \infty$ Therefore, if we pick the largest eigenvalue $c_1$ from $MM^T$ as constant $c$, then ${h^{\langle k \rangle}\over c^k}$ will converge to $q_1z_1$ ::: ### B. Spectral Analysis of PageRank :::success **Notations**. * Denote $N_{ij}$ as the share of $i$'s PageRank that $j$ should get in one update step. * Define $\tilde{N}_{ij}$ to be $sN_{ij} + {(1-s)\over n}$ * Denote the vector of **PageRank** as $r$, and the PageRank of node $i$ as $r_i$ ![](https://i.imgur.com/Pc0CV6r.png) ![](https://i.imgur.com/Fc9hrbI.png) ::: * We could write the **Basic PageRank Update Rule** as: $r_i \leftarrow N_{1i}r_1 + N_{2i}r_2 + ... + N_{ni}r_n \Rightarrow r \leftarrow N^Tr$ * Likewise, write the **Scaled PageRank Update Rule** as: $r_i \leftarrow \tilde{N}_{1i}r_1 + \tilde{N}_{2i}r_2 + ... + \tilde{N}_{ni}r_n \Rightarrow r \leftarrow \tilde{N}^Tr$ #### Convergence of the Scaled PageRank Update Rule * It's easy to have that $$r^{\langle k \rangle} = (\tilde{N}^T)^kr^{\langle 0 \rangle}$$ * We now show the convergence :::info **Proof**. ***Perron's Theorem*** For any **positive** matrix $P$ has the following properties. (i) $P$ has a real eigenvalue $c > 0$ such that $c > |c'|$ for all other eigenvalues $c'$ (ii) The corresponding eigenvector $y$ of $c$ is positive, real and unique. (iii) If $c = 1$ then for any starting non-negative vector $x \neq 0$, the sequence of vectors $P^kx$ converges to $y$ as $k \rightarrow \infty$ </br> Since positive matrix $\tilde{N}$ is a **Markov matrix**, we have that the largest eigenvalue of $\tilde{N}$ is 1, thus by ***Perron's Theorem***, the **scaled PageRank update rule** will converge to $y$ ::: ### C. Formulation of PageRank Using Random Walks Consider following question: if $b_1, b_2, ..., b_n$ denote the probabilities of the walk being at nodes $1,2,...,n$, what is the probability it will be at node $i$ in next step ? 1. For each node $j$ link to $i$, the chance it moves from $j$ to $i$ is ${1 \over l_j}$, where $l_j$ is the number of links out of $j$, so node $j$ contributes $b_j({1 \over l_j})$ to the probability of being at $i$ in next step. 2. Summing $b_j \over l_j$ over all nodes $j$ that links to $i$ We could write the update to probability $b_i$ as : $b_i \leftarrow N_{1i}b_1 + N_{2i}b_2+...+N_{ni}b_n \Rightarrow b \leftarrow N^Tb$ Exactly the same as the **Basic PageRank Update**!!