# Chapter 14. Link Analysis and Web Search
[toc]
## 14.1 Searching the Web: The Problem of Ranking
- Synonymy
- Polysemy
- The diversity in authoring styles
- Dynamic and constantly changing nature of Web content
- How to filter important information from an **enormous** number of relevant documents
## 14.2 Link Analysis Using Hubs and Authorities
- Assume page P is the best result, P may be in the links of relevant pages X.
### Voting by In-Links
Find the page receiving the greatest number of in-links from relevant pages.
### A List-Finding Technique

- pages that compile lists of resources relevant to the topic.

### The Principle of Repeated Improvement

- repeated improvement
### Hubs and Authorities
- Hubs: high-value lists
- Authorities: highly endorsed answers
```
for each page p:
auth(p) = 1
hub(p) = 1
k = TIMES
while k--:
// Authority Update Rule
for each page p:
auth(p) = sum of the hub scores of all pages that point to p
// Hub Update Rule
for each page p:
hub(p) = sum of the authority scores of all pages that p points to.
// normalize: we only care about their relative sizes
sum_auth = sum of all authority scores
sum_hub = sum of all hub scores
for each page p:
auth(p) /= sum_auth
hub(p) /= sum_hub
```


## 14.3 PageRank
- Idea: nodes currently viewed as more important get to make stronger endorsements.
### The Basic Definition of PageRank
- n: # of nodes
- assign all nodes the initial PageRank 1/n
- choose a number of steps k
- k updates using Basic PageRank Update Rule

- $n=8$, init. PageRank $=\frac18$
- PageRank(A)= PR from D + PR from E + PR from F + PR from G + PR from H = $\frac12 \cdot \frac18+\frac12 \cdot \frac18+ \frac18+ \frac18+ \frac18= \frac12$
- A is an important page, we weigh its endorsement more highly in the next update
### Equilibrium Values of PageRank
- PageRank values converge as k → $\infty$ (except in certain degenerate special cases)
- If the network is strongly connected, then there is a unique set of equilibrium values.

### Scaling the Definition of PageRank
- slow leak: F, G

::: success
**Scaled PageRank Update Rule:**
1. Apply the Basic PageRank Update Rule.
2. Scale down all PageRank values by a factor of $s$, where $0\lt s \lt 1$. (Total PageRank has shrunk from $1$ to $s$.)
3. Divide the $1 − s$ units of PageRank over all nodes, giving $(1 − s)/n$ to each.
:::
- Unique equilibrium
### Random Walks: An Equivalent Definition of PageRank.
- The probability of being at a page X after k steps of this random walk = the PageRank of X after k applications of the Basic PageRank Update
Rule.
## 14.4 Applying Link Analysis in Modern Web Search
### Combining link, text, and usage data
* In addition to links, there are many other features as well, such as
* **text**
* **anchor text** : clickable text that activate a hyperlink leading to another page,
eg. [NYCU Timetable](https://timetable.nycu.edu.tw/)
* **click-through rate**
### A moving target
* The search engine results would change in response to users' actions
* The results mattered to a lot of people and companies, such as
* Companies had business models depend on Google's result (tourism industry)
* Website content writer
* A large industry known as ***search engine optimization***,
consist of search experts advise companies on how to create pages and sites that rank highly
* The **"perfect"** ranking function will always be a moving target
## 14.5 Applications beyond the Web
### Citation Analysis
* ***impact factor*** for scientific journal :
Average number of citations received by a paper in the given journal over the past two years
* ***influence weights*** for journals :
Similar to the notion of PageRank for Web pages
## 14.6 Advanced Material: Spectral Analysis, Random Walks, and Web Search
### A. Spectral Analysis of Hubs and Authorities
Our main goal will be to show the convergence of hub and authority scores.
:::success
**Notations**.
We now view a set of $n$ pages as a set of nodes in a **directed graph**,
and thus we can build the adjacency matrix of the graph.
* Denote **adjacency matrix** of graph as $M$
* Denote the vector of **hub score** as $h$, and the hub score of node $i$ as $h_i$
* Denote the vector of **authority score** as $a$, and the authority score of node $i$ as $a_i$

:::
#### Hub and Authority Update Rules as Matrix-Vector Multiplication.
* **Recall** :
* **Hub Update Rule** : Update *hub\($p$\)* to be the sum of the authority scores that it points to.
* **Authority Update Rule** : Update *auth($p$)* to be the sum of the hub scores that point to it.
* We could write update rule as
* **Hub Update Rule** :
$h_i \leftarrow M_{i1}a_1 + M_{i2}a_1 + ... + M_{in}a_n\Rightarrow h \leftarrow Ma$
* **Authority Update Rule** :
$a_i \leftarrow M_{1i}h_1 + M_{2i}h_2 + ... + M_{ni}h_n \Rightarrow a \leftarrow M^Th$
#### Unwinding the k-step hub-authority computation
:::success
**Notations**.
* Denote $a^{\langle k \rangle}$ and $h^{\langle k \rangle}$ as the vectors of authority and hub scores after $k$ updates.
:::
We first find that
$$a^{\langle 1 \rangle} = M^Th^{\langle 0 \rangle}$$
and
$$h^{\langle 1 \rangle} = Ma^{\langle 1 \rangle} = (MM^T)^1h^{\langle 0 \rangle}$$
second
$$a^{\langle 2 \rangle} = M^Th^{\langle 1 \rangle} = (M^TM)^1M^Th^{\langle 0 \rangle}$$
and
$$h^{\langle 2 \rangle} = Ma^{\langle 2 \rangle} = MM^TMM^Th^{\langle 0 \rangle} = (MM^T)^2h^{\langle 0 \rangle}$$
One more step makes the pattern clear:
$$a^{\langle 3 \rangle} = M^Th^{\langle 2 \rangle} = M^TMM^TMM^Th^{\langle 0 \rangle} = (M^TM)^2M^Th^{\langle 0 \rangle}$$
and
$$h^{\langle 3 \rangle} = Ma^{\langle 3 \rangle} = MM^TMM^TMM^Th^{\langle 0 \rangle} = (MM^T)^3h^{\langle 0 \rangle}$$
We've found the pattern, and can write
$$a^{\langle k \rangle} = (M^TM)^{k-1}M^Th^{\langle 0 \rangle}$$
and
$$h^{\langle k \rangle} = (MM^T)^kh^{\langle 0 \rangle}$$
#### Thinking about multiplication in terms of eigenvectors
* Hub and authority values tend to grow with each update,
they will only converge when we take **normalization** into account.
* We now show that there are constants $c$ and $d$ so that $h^{\langle k \rangle} \over c^k$ and $a^{\langle k \rangle} \over d^k$ converge as $k \rightarrow \infty$
:::info
**Proof**.
If $${h^{\langle k \rangle} \over c^k} = {(MM^T)^kh^{\langle 0 \rangle} \over c^k}$$ converge to $h^{\langle * \rangle}$, that is, $h^{\langle * \rangle}$ satisfy the equation $$(MM^T)h^{\langle * \rangle}=ch^{\langle * \rangle}$$
We found that $c$ and $h^{\langle * \rangle}$ are **eigenvalue** and **eigenvector** of $MM^T$ respectively
:::
:::warning
Recall from Linear Algebra:
*Any symmetric matrix $A$ with $n$ rows and $n$ columns has a set of n eigenvectors that are all unit vectors and all mutually orthogonal — that is, they form a basis for the space $R^n$*
:::
:::info
**Cont.**
$MM^T$ is symmetric, so $MM^T$ has $n$ mutually orthogonal eigenvectors $z_1, z_2, ..., z_n$ </br> with corresponding eigenvalues $c_1, c_2, ..., c_n$ (Let $|c_1| \geq |c_2| \geq ... \geq |c_n|$)
</br>
Now, given any vector $x = p_1z_1 + p_2z_2 + ... +p_nz_n$ (since $z_1, z_2,...,z_n$ are basis of $R^n$)
we have
$$
\begin{align}
(MM^T)x &= (MM^T)(p_1z_1 + p_2z_2 + ... +p_nz_n) \\ &= p_1MM^Tz_1 + p_2MM^Tz_2 +...+p_nMM^Tz_n \\ &= p_1c_1z_1 + p_2c_2z_2 + ... + p_nc_nz_n
\\ \Rightarrow (MM^T)^kx &= c_1^kp_1z_1 + c_2^kp_2z_2+...+c_n^kp_nz_n
\end{align}
$$
Now consider $h^{\langle k \rangle}=(MM^T)^kh^{\langle 0 \rangle}$ and let $h^{\langle 0 \rangle} = q_1z_1+q_2z_2+...+q_nz_n$
we have that
$$h^{\langle k \rangle}=(MM^T)^kh^{\langle 0 \rangle}=c_1^kq_1z_1 + c_2^kq_2z_2+...+c_n^kq_nz_n$$
divide both sides by $c_1^k$
$${h^{\langle k \rangle}\over c_1^k}= q_1z_1 + ({c_2\over c_1})^kq_2z_2+...+({c_n\over c_1})^kq_nz_n$$
thus, ${h^{\langle k \rangle}\over c_1^k} = q_1z_1$ as $k \rightarrow \infty$
Therefore, if we pick the largest eigenvalue $c_1$ from $MM^T$ as constant $c$,
then ${h^{\langle k \rangle}\over c^k}$ will converge to $q_1z_1$
:::
### B. Spectral Analysis of PageRank
:::success
**Notations**.
* Denote $N_{ij}$ as the share of $i$'s PageRank that $j$ should get in one update step.
* Define $\tilde{N}_{ij}$ to be $sN_{ij} + {(1-s)\over n}$
* Denote the vector of **PageRank** as $r$, and the PageRank of node $i$ as $r_i$


:::
* We could write the **Basic PageRank Update Rule** as:
$r_i \leftarrow N_{1i}r_1 + N_{2i}r_2 + ... + N_{ni}r_n \Rightarrow r \leftarrow N^Tr$
* Likewise, write the **Scaled PageRank Update Rule** as:
$r_i \leftarrow \tilde{N}_{1i}r_1 + \tilde{N}_{2i}r_2 + ... + \tilde{N}_{ni}r_n \Rightarrow r \leftarrow \tilde{N}^Tr$
#### Convergence of the Scaled PageRank Update Rule
* It's easy to have that
$$r^{\langle k \rangle} = (\tilde{N}^T)^kr^{\langle 0 \rangle}$$
* We now show the convergence
:::info
**Proof**.
***Perron's Theorem***
For any **positive** matrix $P$ has the following properties.
(i) $P$ has a real eigenvalue $c > 0$ such that $c > |c'|$ for all other eigenvalues $c'$
(ii) The corresponding eigenvector $y$ of $c$ is positive, real and unique.
(iii) If $c = 1$ then for any starting non-negative vector $x \neq 0$, the sequence of vectors $P^kx$ converges to $y$ as $k \rightarrow \infty$
</br>
Since positive matrix $\tilde{N}$ is a **Markov matrix**, we have that the largest eigenvalue of $\tilde{N}$ is 1,
thus by ***Perron's Theorem***, the **scaled PageRank update rule** will converge to $y$
:::
### C. Formulation of PageRank Using Random Walks
Consider following question: if $b_1, b_2, ..., b_n$ denote the probabilities of the walk being at nodes $1,2,...,n$, what is the probability it will be at node $i$ in next step ?
1. For each node $j$ link to $i$, the chance it moves from $j$ to $i$ is ${1 \over l_j}$, where $l_j$ is the number of links out of $j$, so node $j$ contributes $b_j({1 \over l_j})$ to the probability of being at $i$ in next step.
2. Summing $b_j \over l_j$ over all nodes $j$ that links to $i$
We could write the update to probability $b_i$ as :
$b_i \leftarrow N_{1i}b_1 + N_{2i}b_2+...+N_{ni}b_n \Rightarrow b \leftarrow N^Tb$
Exactly the same as the **Basic PageRank Update**!!