Try   HackMD

Chapter 14. Link Analysis and Web Search

14.1 Searching the Web: The Problem of Ranking

  • Synonymy
  • Polysemy
  • The diversity in authoring styles
  • Dynamic and constantly changing nature of Web content
  • How to filter important information from an enormous number of relevant documents
  • Assume page P is the best result, P may be in the links of relevant pages X.

Find the page receiving the greatest number of in-links from relevant pages.

A List-Finding Technique

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • pages that compile lists of resources relevant to the topic.
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

The Principle of Repeated Improvement

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • repeated improvement

Hubs and Authorities

  • Hubs: high-value lists
  • Authorities: highly endorsed answers
for each page p:
    auth(p) = 1
    hub(p) = 1
    
k = TIMES
while k--:
    // Authority Update Rule
    for each page p:
        auth(p) = sum of the hub scores of all pages that point to p
    // Hub Update Rule
    for each page p:
        hub(p) = sum of the authority scores of all pages that p points to.

// normalize: we only care about their relative sizes
sum_auth = sum of all authority scores
sum_hub = sum of all hub scores
for each page p:
    auth(p) /= sum_auth
    hub(p) /= sum_hub

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

14.3 PageRank

  • Idea: nodes currently viewed as more important get to make stronger endorsements.

The Basic Definition of PageRank

  • n: # of nodes
  • assign all nodes the initial PageRank 1/n
  • choose a number of steps k
  • k updates using Basic PageRank Update Rule
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
  • n=8
    , init. PageRank
    =18
  • PageRank(A)= PR from D + PR from E + PR from F + PR from G + PR from H =
    1218+1218+18+18+18=12
  • A is an important page, we weigh its endorsement more highly in the next update

Equilibrium Values of PageRank

  • PageRank values converge as k →
    (except in certain degenerate special cases)
  • If the network is strongly connected, then there is a unique set of equilibrium values.
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

Scaling the Definition of PageRank

  • slow leak: F, G
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

Scaled PageRank Update Rule:

  1. Apply the Basic PageRank Update Rule.
  2. Scale down all PageRank values by a factor of
    s
    , where
    0<s<1
    . (Total PageRank has shrunk from
    1
    to
    s
    .)
  3. Divide the
    1s
    units of PageRank over all nodes, giving
    (1s)/n
    to each.
  • Unique equilibrium

Random Walks: An Equivalent Definition of PageRank.

  • The probability of being at a page X after k steps of this random walk = the PageRank of X after k applications of the Basic PageRank Update
    Rule.
  • In addition to links, there are many other features as well, such as
    • text
    • anchor text : clickable text that activate a hyperlink leading to another page,
      eg. NYCU Timetable
    • click-through rate

A moving target

  • The search engine results would change in response to users' actions
  • The results mattered to a lot of people and companies, such as
    • Companies had business models depend on Google's result (tourism industry)
    • Website content writer
  • A large industry known as search engine optimization,
    consist of search experts advise companies on how to create pages and sites that rank highly
  • The "perfect" ranking function will always be a moving target

14.5 Applications beyond the Web

Citation Analysis

  • impact factor for scientific journal :
    Average number of citations received by a paper in the given journal over the past two years
  • influence weights for journals :
    Similar to the notion of PageRank for Web pages

A. Spectral Analysis of Hubs and Authorities

Our main goal will be to show the convergence of hub and authority scores.

Notations.
We now view a set of

n pages as a set of nodes in a directed graph,
and thus we can build the adjacency matrix of the graph.

  • Denote adjacency matrix of graph as
    M
  • Denote the vector of hub score as
    h
    , and the hub score of node
    i
    as
    hi
  • Denote the vector of authority score as
    a
    , and the authority score of node
    i
    as
    ai

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Hub and Authority Update Rules as Matrix-Vector Multiplication.

  • Recall :

    • Hub Update Rule : Update hub(
      p
      )
      to be the sum of the authority scores that it points to.
    • Authority Update Rule : Update auth(
      p
      )
      to be the sum of the hub scores that point to it.
  • We could write update rule as

    • Hub Update Rule :
      hiMi1a1+Mi2a1+...+MinanhMa
    • Authority Update Rule :
      aiM1ih1+M2ih2+...+MnihnaMTh

Unwinding the k-step hub-authority computation

Notations.

  • Denote
    ak
    and
    hk
    as the vectors of authority and hub scores after
    k
    updates.

We first find that

a1=MTh0
and
h1=Ma1=(MMT)1h0

second
a2=MTh1=(MTM)1MTh0

and
h2=Ma2=MMTMMTh0=(MMT)2h0

One more step makes the pattern clear:
a3=MTh2=MTMMTMMTh0=(MTM)2MTh0

and
h3=Ma3=MMTMMTMMTh0=(MMT)3h0

We've found the pattern, and can write
ak=(MTM)k1MTh0

and
hk=(MMT)kh0

Thinking about multiplication in terms of eigenvectors

  • Hub and authority values tend to grow with each update,
    they will only converge when we take normalization into account.
  • We now show that there are constants
    c
    and
    d
    so that
    hkck
    and
    akdk
    converge as
    k

Proof.
If

hkck=(MMT)kh0ck converge to
h
, that is,
h
satisfy the equation
(MMT)h=ch

We found that
c
and
h
are eigenvalue and eigenvector of
MMT
respectively

Recall from Linear Algebra:
Any symmetric matrix

A with
n
rows and
n
columns has a set of n eigenvectors that are all unit vectors and all mutually orthogonal — that is, they form a basis for the space
Rn

Cont.

MMT is symmetric, so
MMT
has
n
mutually orthogonal eigenvectors
z1,z2,...,zn

with corresponding eigenvalues
c1,c2,...,cn
(Let
|c1||c2|...|cn|
)


Now, given any vector
x=p1z1+p2z2+...+pnzn
(since
z1,z2,...,zn
are basis of
Rn
)
we have
(MMT)x=(MMT)(p1z1+p2z2+...+pnzn)=p1MMTz1+p2MMTz2+...+pnMMTzn=p1c1z1+p2c2z2+...+pncnzn(MMT)kx=c1kp1z1+c2kp2z2+...+cnkpnzn

Now consider

hk=(MMT)kh0 and let
h0=q1z1+q2z2+...+qnzn

we have that
hk=(MMT)kh0=c1kq1z1+c2kq2z2+...+cnkqnzn

divide both sides by
c1k

hkc1k=q1z1+(c2c1)kq2z2+...+(cnc1)kqnzn

thus,

hkc1k=q1z1 as
k

Therefore, if we pick the largest eigenvalue

c1 from
MMT
as constant
c
,
then
hkck
will converge to
q1z1

B. Spectral Analysis of PageRank

Notations.

  • Denote
    Nij
    as the share of
    i
    's PageRank that
    j
    should get in one update step.
  • Define
    N~ij
    to be
    sNij+(1s)n
  • Denote the vector of PageRank as
    r
    , and the PageRank of node
    i
    as
    ri

  • We could write the Basic PageRank Update Rule as:
    riN1ir1+N2ir2+...+NnirnrNTr
  • Likewise, write the Scaled PageRank Update Rule as:
    riN~1ir1+N~2ir2+...+N~nirnrN~Tr

Convergence of the Scaled PageRank Update Rule

  • It's easy to have that
    rk=(N~T)kr0
  • We now show the convergence

Proof.
Perron's Theorem
For any positive matrix

P has the following properties.
(i)
P
has a real eigenvalue
c>0
such that
c>|c|
for all other eigenvalues
c

(ii) The corresponding eigenvector
y
of
c
is positive, real and unique.
(iii) If
c=1
then for any starting non-negative vector
x0
, the sequence of vectors
Pkx
converges to
y
as
k


Since positive matrix

N~ is a Markov matrix, we have that the largest eigenvalue of
N~
is 1,
thus by Perron's Theorem, the scaled PageRank update rule will converge to
y

C. Formulation of PageRank Using Random Walks

Consider following question: if

b1,b2,...,bn denote the probabilities of the walk being at nodes
1,2,...,n
, what is the probability it will be at node
i
in next step ?

  1. For each node
    j
    link to
    i
    , the chance it moves from
    j
    to
    i
    is
    1lj
    , where
    lj
    is the number of links out of
    j
    , so node
    j
    contributes
    bj(1lj)
    to the probability of being at
    i
    in next step.
  2. Summing
    bjlj
    over all nodes
    j
    that links to
    i

We could write the update to probability

bi as :
biN1ib1+N2ib2+...+NnibnbNTb

Exactly the same as the Basic PageRank Update!!