Chapter 14. Link Analysis and Web Search

Chapter 14. Link Analysis and Web Search

14.1 Searching the Web: The Problem of Ranking

Synonymy
Polysemy
The diversity in authoring styles
Dynamic and constantly changing nature of Web content
How to ﬁlter important information from an enormous number of relevant documents

14.2 Link Analysis Using Hubs and Authorities

Assume page P is the best result, P may be in the links of relevant pages X.

Voting by In-Links

Find the page receiving the greatest number of in-links from relevant pages.

A List-Finding Technique

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

pages that compile lists of resources relevant to the topic.
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

The Principle of Repeated Improvement

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

repeated improvement

Hubs and Authorities

Hubs: high-value lists
Authorities: highly endorsed answers

for each page p:
    auth(p) = 1
    hub(p) = 1
    
k = TIMES
while k--:
    // Authority Update Rule
    for each page p:
        auth(p) = sum of the hub scores of all pages that point to p
    // Hub Update Rule
    for each page p:
        hub(p) = sum of the authority scores of all pages that p points to.

// normalize: we only care about their relative sizes
sum_auth = sum of all authority scores
sum_hub = sum of all hub scores
for each page p:
    auth(p) /= sum_auth
    hub(p) /= sum_hub

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

14.3 PageRank

Idea: nodes currently viewed as more important get to make stronger endorsements.

The Basic Deﬁnition of PageRank

n: # of nodes
assign all nodes the initial PageRank 1/n
choose a number of steps k
k updates using Basic PageRank Update Rule
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
$n = 8$ , init. PageRank
$= \frac{1}{8}$
PageRank(A)= PR from D + PR from E + PR from F + PR from G + PR from H =
$\frac{1}{2} \cdot \frac{1}{8} + \frac{1}{2} \cdot \frac{1}{8} + \frac{1}{8} + \frac{1}{8} + \frac{1}{8} = \frac{1}{2}$
A is an important page, we weigh its endorsement more highly in the next update

Equilibrium Values of PageRank

PageRank values converge as k →
$\infty$ (except in certain degenerate special cases)
If the network is strongly connected, then there is a unique set of equilibrium values.
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Scaling the Deﬁnition of PageRank

slow leak: F, G
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Scaled PageRank Update Rule:

Apply the Basic PageRank Update Rule.
Scale down all PageRank values by a factor of
$s$ , where
$0 < s < 1$ . (Total PageRank has shrunk from
$1$ to
$s$ .)
Divide the
$1 - s$ units of PageRank over all nodes, giving
$(1 - s) / n$ to each.

Unique equilibrium

Random Walks: An Equivalent Deﬁnition of PageRank.

The probability of being at a page X after k steps of this random walk = the PageRank of X after k applications of the Basic PageRank Update
Rule.

14.4 Applying Link Analysis in Modern Web Search

Combining link, text, and usage data

In addition to links, there are many other features as well, such as
- text
- anchor text : clickable text that activate a hyperlink leading to another page,
  eg. NYCU Timetable
- click-through rate

A moving target

The search engine results would change in response to users' actions
The results mattered to a lot of people and companies, such as
- Companies had business models depend on Google's result (tourism industry)
- Website content writer
A large industry known as search engine optimization,
consist of search experts advise companies on how to create pages and sites that rank highly
The "perfect" ranking function will always be a moving target

14.5 Applications beyond the Web

Citation Analysis

impact factor for scientific journal :
Average number of citations received by a paper in the given journal over the past two years
influence weights for journals :
Similar to the notion of PageRank for Web pages

14.6 Advanced Material: Spectral Analysis, Random Walks, and Web Search

A. Spectral Analysis of Hubs and Authorities

Our main goal will be to show the convergence of hub and authority scores.

Notations.
We now view a set of

n

pages as a set of nodes in a directed graph,
and thus we can build the adjacency matrix of the graph.

Denote adjacency matrix of graph as
$M$
Denote the vector of hub score as
$h$ , and the hub score of node
$i$ as
$h_{i}$
Denote the vector of authority score as
$a$ , and the authority score of node
$i$ as
$a_{i}$

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Hub and Authority Update Rules as Matrix-Vector Multiplication.

Recall :
- Hub Update Rule : Update hub(
  $p$ ) to be the sum of the authority scores that it points to.
- Authority Update Rule : Update auth(
  $p$ ) to be the sum of the hub scores that point to it.
We could write update rule as
- Hub Update Rule :
  
  $h_{i} \leftarrow M_{i 1} a_{1} + M_{i 2} a_{1} + . . . + M_{i n} a_{n} \Rightarrow h \leftarrow M a$
- Authority Update Rule :
  
  $a_{i} \leftarrow M_{1 i} h_{1} + M_{2 i} h_{2} + . . . + M_{n i} h_{n} \Rightarrow a \leftarrow M^{T} h$

Unwinding the k-step hub-authority computation

Notations.

Denote
$a^{⟨ k ⟩}$ and
$h^{⟨ k ⟩}$ as the vectors of authority and hub scores after
$k$ updates.

We first find that

a^{⟨ 1 ⟩} = M^{T} h^{⟨ 0 ⟩}

and

h^{⟨ 1 ⟩} = M a^{⟨ 1 ⟩} = (M M^{T})^{1} h^{⟨ 0 ⟩}

second

a^{⟨ 2 ⟩} = M^{T} h^{⟨ 1 ⟩} = (M^{T} M)^{1} M^{T} h^{⟨ 0 ⟩}

and

h^{⟨ 2 ⟩} = M a^{⟨ 2 ⟩} = M M^{T} M M^{T} h^{⟨ 0 ⟩} = (M M^{T})^{2} h^{⟨ 0 ⟩}

One more step makes the pattern clear:

a^{⟨ 3 ⟩} = M^{T} h^{⟨ 2 ⟩} = M^{T} M M^{T} M M^{T} h^{⟨ 0 ⟩} = (M^{T} M)^{2} M^{T} h^{⟨ 0 ⟩}

and

h^{⟨ 3 ⟩} = M a^{⟨ 3 ⟩} = M M^{T} M M^{T} M M^{T} h^{⟨ 0 ⟩} = (M M^{T})^{3} h^{⟨ 0 ⟩}

We've found the pattern, and can write

a^{⟨ k ⟩} = (M^{T} M)^{k - 1} M^{T} h^{⟨ 0 ⟩}

and

h^{⟨ k ⟩} = (M M^{T})^{k} h^{⟨ 0 ⟩}

Thinking about multiplication in terms of eigenvectors

Hub and authority values tend to grow with each update,
they will only converge when we take normalization into account.
We now show that there are constants
$c$ and
$d$ so that
$\frac{h^{⟨ k ⟩}}{c^{k}}$ and
$\frac{a^{⟨ k ⟩}}{d^{k}}$ converge as
$k \to \infty$

Proof.
If

\frac{h^{⟨ k ⟩}}{c^{k}} = \frac{(M M^{T})^{k} h^{⟨ 0 ⟩}}{c^{k}}

converge to

h^{⟨ * ⟩}

, that is,

h^{⟨ * ⟩}

satisfy the equation

(M M^{T}) h^{⟨ * ⟩} = c h^{⟨ * ⟩}

We found that

c

and

h^{⟨ * ⟩}

are eigenvalue and eigenvector of

M M^{T}

respectively

Recall from Linear Algebra:
Any symmetric matrix

$A$ with
$n$ rows and
$n$ columns has a set of n eigenvectors that are all unit vectors and all mutually orthogonal — that is, they form a basis for the space
$R^{n}$

Cont.

M M^{T}

is symmetric, so

M M^{T}

has

n

mutually orthogonal eigenvectors

z_{1}, z_{2}, . . ., z_{n}

with corresponding eigenvalues

c_{1}, c_{2}, . . ., c_{n}

(Let

| c_{1} | \geq | c_{2} | \geq . . . \geq | c_{n} |

)

Now, given any vector

x = p_{1} z_{1} + p_{2} z_{2} + . . . + p_{n} z_{n}

(since

z_{1}, z_{2}, . . ., z_{n}

are basis of

R^{n}

)
we have

\begin{aligned} (M M^{T}) x & = (M M^{T}) (p_{1} z_{1} + p_{2} z_{2} + . . . + p_{n} z_{n}) \\ = p_{1} M M^{T} z_{1} + p_{2} M M^{T} z_{2} + . . . + p_{n} M M^{T} z_{n} \\ = p_{1} c_{1} z_{1} + p_{2} c_{2} z_{2} + . . . + p_{n} c_{n} z_{n} \\ \Rightarrow (M M^{T})^{k} x & = c_{1}^{k} p_{1} z_{1} + c_{2}^{k} p_{2} z_{2} + . . . + c_{n}^{k} p_{n} z_{n} \end{aligned}

Now consider

h^{⟨ k ⟩} = (M M^{T})^{k} h^{⟨ 0 ⟩}

and let

h^{⟨ 0 ⟩} = q_{1} z_{1} + q_{2} z_{2} + . . . + q_{n} z_{n}

we have that

h^{⟨ k ⟩} = (M M^{T})^{k} h^{⟨ 0 ⟩} = c_{1}^{k} q_{1} z_{1} + c_{2}^{k} q_{2} z_{2} + . . . + c_{n}^{k} q_{n} z_{n}

divide both sides by

c_{1}^{k}

\frac{h^{⟨ k ⟩}}{c_{1}^{k}} = q_{1} z_{1} + (\frac{c_{2}}{c_{1}})^{k} q_{2} z_{2} + . . . + (\frac{c_{n}}{c_{1}})^{k} q_{n} z_{n}

thus,

\frac{h^{⟨ k ⟩}}{c_{1}^{k}} = q_{1} z_{1}

k \to \infty

Therefore, if we pick the largest eigenvalue

c_{1}

from

M M^{T}

as constant

c

,
then

\frac{h^{⟨ k ⟩}}{c^{k}}

will converge to

q_{1} z_{1}

B. Spectral Analysis of PageRank

Notations.

Denote
$N_{i j}$ as the share of
$i$ 's PageRank that
$j$ should get in one update step.
Define
${\tilde{N}}_{i j}$ to be
$s N_{i j} + \frac{(1 - s)}{n}$
Denote the vector of PageRank as
$r$ , and the PageRank of node
$i$ as
$r_{i}$

We could write the Basic PageRank Update Rule as:

$r_{i} \leftarrow N_{1 i} r_{1} + N_{2 i} r_{2} + . . . + N_{n i} r_{n} \Rightarrow r \leftarrow N^{T} r$
Likewise, write the Scaled PageRank Update Rule as:

$r_{i} \leftarrow {\tilde{N}}_{1 i} r_{1} + {\tilde{N}}_{2 i} r_{2} + . . . + {\tilde{N}}_{n i} r_{n} \Rightarrow r \leftarrow {\tilde{N}}^{T} r$

Convergence of the Scaled PageRank Update Rule

It's easy to have that

$r^{⟨ k ⟩} = ({\tilde{N}}^{T})^{k} r^{⟨ 0 ⟩}$
We now show the convergence

Proof.
Perron's Theorem
For any positive matrix

P

has the following properties.
(i)

P

has a real eigenvalue

c > 0

such that

c > | c^{'} |

for all other eigenvalues

c^{'}

(ii) The corresponding eigenvector

y

c

is positive, real and unique.
(iii) If

c = 1

then for any starting non-negative vector

x \neq 0

, the sequence of vectors

P^{k} x

converges to

y

k \to \infty

Since positive matrix

\tilde{N}

is a Markov matrix, we have that the largest eigenvalue of

\tilde{N}

is 1,
thus by Perron's Theorem, the scaled PageRank update rule will converge to

y

C. Formulation of PageRank Using Random Walks

Consider following question: if

b_{1}, b_{2}, . . ., b_{n}

denote the probabilities of the walk being at nodes

1, 2, . . ., n

, what is the probability it will be at node

i

in next step ?

For each node
$j$ link to
$i$ , the chance it moves from
$j$ to
$i$ is
$\frac{1}{l_{j}}$ , where
$l_{j}$ is the number of links out of
$j$ , so node
$j$ contributes
$b_{j} (\frac{1}{l_{j}})$ to the probability of being at
$i$ in next step.
Summing
$\frac{b_{j}}{l_{j}}$ over all nodes
$j$ that links to
$i$

We could write the update to probability

b_{i}

as :

b_{i} \leftarrow N_{1 i} b_{1} + N_{2 i} b_{2} + . . . + N_{n i} b_{n} \Rightarrow b \leftarrow N^{T} b

Exactly the same as the Basic PageRank Update!!

Chapter 14. Link Analysis and Web Search

14.1 Searching the Web: The Problem of Ranking

14.2 Link Analysis Using Hubs and Authorities

Voting by In-Links

A List-Finding Technique

The Principle of Repeated Improvement

Hubs and Authorities

14.3 PageRank

The Basic Deﬁnition of PageRank

Equilibrium Values of PageRank

Scaling the Deﬁnition of PageRank

Random Walks: An Equivalent Deﬁnition of PageRank.

14.4 Applying Link Analysis in Modern Web Search

Combining link, text, and usage data

A moving target

14.5 Applications beyond the Web

Citation Analysis

14.6 Advanced Material: Spectral Analysis, Random Walks, and Web Search

A. Spectral Analysis of Hubs and Authorities

Hub and Authority Update Rules as Matrix-Vector Multiplication.

Unwinding the k-step hub-authority computation

Thinking about multiplication in terms of eigenvectors

B. Spectral Analysis of PageRank

Convergence of the Scaled PageRank Update Rule

C. Formulation of PageRank Using Random Walks

Read more

論文概要

Meeting Note

Meeting Note 2/9

Meeting Note 12/17 (Author Contribution)