Chapter 14. Link Analysis and Web Search
14.1 Searching the Web: The Problem of Ranking
- Synonymy
- Polysemy
- The diversity in authoring styles
- Dynamic and constantly changing nature of Web content
- How to filter important information from an enormous number of relevant documents
14.2 Link Analysis Using Hubs and Authorities
- Assume page P is the best result, P may be in the links of relevant pages X.
Voting by In-Links
Find the page receiving the greatest number of in-links from relevant pages.
A List-Finding Technique
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- pages that compile lists of resources relevant to the topic.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
The Principle of Repeated Improvement
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Hubs and Authorities
- Hubs: high-value lists
- Authorities: highly endorsed answers
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- Idea: nodes currently viewed as more important get to make stronger endorsements.
- n: # of nodes
- assign all nodes the initial PageRank 1/n
- choose a number of steps k
- k updates using Basic PageRank Update Rule
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- , init. PageRank
- PageRank(A)= PR from D + PR from E + PR from F + PR from G + PR from H =
- A is an important page, we weigh its endorsement more highly in the next update
- PageRank values converge as k → (except in certain degenerate special cases)
- If the network is strongly connected, then there is a unique set of equilibrium values.
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- slow leak: F, G
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Scaled PageRank Update Rule:
- Apply the Basic PageRank Update Rule.
- Scale down all PageRank values by a factor of , where . (Total PageRank has shrunk from to .)
- Divide the units of PageRank over all nodes, giving to each.
Random Walks: An Equivalent Definition of PageRank.
- The probability of being at a page X after k steps of this random walk = the PageRank of X after k applications of the Basic PageRank Update
Rule.
14.4 Applying Link Analysis in Modern Web Search
Combining link, text, and usage data
- In addition to links, there are many other features as well, such as
- text
- anchor text : clickable text that activate a hyperlink leading to another page,
eg. NYCU Timetable
- click-through rate
A moving target
- The search engine results would change in response to users' actions
- The results mattered to a lot of people and companies, such as
- Companies had business models depend on Google's result (tourism industry)
- Website content writer
- A large industry known as search engine optimization,
consist of search experts advise companies on how to create pages and sites that rank highly
- The "perfect" ranking function will always be a moving target
14.5 Applications beyond the Web
Citation Analysis
- impact factor for scientific journal :
Average number of citations received by a paper in the given journal over the past two years
- influence weights for journals :
Similar to the notion of PageRank for Web pages
14.6 Advanced Material: Spectral Analysis, Random Walks, and Web Search
A. Spectral Analysis of Hubs and Authorities
Our main goal will be to show the convergence of hub and authority scores.
Notations.
We now view a set of pages as a set of nodes in a directed graph,
and thus we can build the adjacency matrix of the graph.
- Denote adjacency matrix of graph as
- Denote the vector of hub score as , and the hub score of node as
- Denote the vector of authority score as , and the authority score of node as
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Hub and Authority Update Rules as Matrix-Vector Multiplication.
Unwinding the k-step hub-authority computation
Notations.
- Denote and as the vectors of authority and hub scores after updates.
We first find that
and
second
and
One more step makes the pattern clear:
and
We've found the pattern, and can write
and
Thinking about multiplication in terms of eigenvectors
- Hub and authority values tend to grow with each update,
they will only converge when we take normalization into account.
- We now show that there are constants and so that and converge as
Proof.
If converge to , that is, satisfy the equation
We found that and are eigenvalue and eigenvector of respectively
Recall from Linear Algebra:
Any symmetric matrix with rows and columns has a set of n eigenvectors that are all unit vectors and all mutually orthogonal — that is, they form a basis for the space
Cont.
is symmetric, so has mutually orthogonal eigenvectors
with corresponding eigenvalues (Let )
Now, given any vector (since are basis of )
we have
Now consider and let
we have that
divide both sides by
thus, as
Therefore, if we pick the largest eigenvalue from as constant ,
then will converge to
Notations.
- Denote as the share of 's PageRank that should get in one update step.
- Define to be
- Denote the vector of PageRank as , and the PageRank of node as


- We could write the Basic PageRank Update Rule as:
- Likewise, write the Scaled PageRank Update Rule as:
- It's easy to have that
- We now show the convergence
Proof.
Perron's Theorem
For any positive matrix has the following properties.
(i) has a real eigenvalue such that for all other eigenvalues
(ii) The corresponding eigenvector of is positive, real and unique.
(iii) If then for any starting non-negative vector , the sequence of vectors converges to as
Since positive matrix is a Markov matrix, we have that the largest eigenvalue of is 1,
thus by Perron's Theorem, the scaled PageRank update rule will converge to
C. Formulation of PageRank Using Random Walks
Consider following question: if denote the probabilities of the walk being at nodes , what is the probability it will be at node in next step ?
- For each node link to , the chance it moves from to is , where is the number of links out of , so node contributes to the probability of being at in next step.
- Summing over all nodes that links to
We could write the update to probability as :
Exactly the same as the Basic PageRank Update!!