PageRank - HackMD

# PageRank ### Big Idea The *PageRank vector* is the dominant eigenvector (the steady state) of the *Google matrix*, a specially constructed stochastic matrix derived from a directed graph of webpages. The components of this eigenvector provide a quantitative measure of the *importance* of each corresponding vertex (webpage) in the graph. ## Directed Graph and Stochastic Matrix A collection of webpages and the links between them can be modeled as a *directed graph (DG)* with $N$ vertices (webpages) and edges (links)1. * **Adjacency Matrix $A$:** An $N \times N$ matrix where $a_{i,j}=1$ if there is a link from webpage $j$ to webpage $i$, and $a_{i,j}=0$ otherwise. * **Stochastic Matrix $P$:** Represents the probability of a random click moving from one page to another. It is constructed from the adjacency matrix by normalizing each column: $$p_{i,j} = \frac{a_{i,j}}{\text{total # of links from webpage } j}$$ The entry $p_{i,j}$ is the probability of clicking to webpage $i$ from webpage $j$. :::warning **Example** Consider the directed graph ![image](https://hackmd.io/_uploads/S1O2WzT1Wl.png) For the given graph, the corresponding adjacency matrix $A$ and the stochastic matrix $P$ are: $$ A = \begin{bmatrix} 0 & 1 & 1 & 0 \\ 1 & 0 & 1 & 1 \\ 0 & 0 & 0 & 0 \\ 0 & 1 & 1 & 0 \end{bmatrix} \hspace{10mm} P = \begin{bmatrix} 0 & 1/2 & 1/3 & 0 \\ 1 & 0 & 1/3 & 1 \\ 0 & 0 & 0 & 0 \\ 0 & 1/2 & 1/3 & 0 \end{bmatrix} $$ ::: ### Steady State The fundamental question is whether the iterative process of random clicks converges to a stable probability distribution, $\mathbf{x} = \lim_{n \to \infty} P^n \mathbf{x}_0$. * If the limit $\mathbf{x}$ exists, it satisfies the equation $P\mathbf{x} = \mathbf{x}$. * This means $\mathbf{x}$ is an eigenvector of $P$ corresponding to the eigenvalue $\lambda=1$, called the **steady state** vector. #### Properties of Stochastic Matrices 1. **Eigenvalue $\lambda=1$:** Every stochastic matrix $P$ is guaranteed to have an eigenvalue $\lambda=1$ because its transpose, $P^T$, has $\mathbf{1}$ (a vector of all ones) as an eigenvector: $P^T\mathbf{1} = \mathbf{1}$. 2. **Bound on Eigenvalues:** All other eigenvalues $\lambda$ of $P$ satisfy $|\lambda| \leq 1$. 3. **Dominant Eigenvalue:** Convergence to a unique steady state requires $\lambda=1$ to be the dominant eigenvalue ($|\lambda_1| > |\lambda_j|$ for all $j \neq 1$). The *Perron–Frobenius Theorem* guarantees that if $P^k$ has all positive entries for some $k \ge 1$, then $\lambda_1 = 1$ is the single dominant eigenvalue, ensuring a unique, positive steady state. :::warning **Example (Convergence to Steady State)** 1. **Non-Dominant Eigenvalue Example (Failure to Converge)** Consider the stochastic matrix $P = \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}$. * The characteristic polynomial is $C_P(\lambda) = \lambda^2 - 1$, yielding eigenvalues $\lambda_1 = 1$ and $\lambda_2 = -1$. * Since $|\lambda_1| = |\lambda_2|$, there is *no dominant eigenvalue*. * Applying the power iteration shows non-convergence to a single vector:$$P^n \begin{bmatrix} a \\ b \end{bmatrix} = \begin{cases} \begin{bmatrix} a \\ b \end{bmatrix}, & n \text{ even} \\ \begin{bmatrix} b \\ a \end{bmatrix}, & n \text{ odd} \end{cases}$$ * The limit $\displaystyle \lim_{n \to \infty} P^n \mathbf{x}_0$ does not exist unless the starting vector is already the steady state ($a=b$). 2. **Dominant Eigenvalue Example (Guaranteed Convergence)** Consider the stochastic matrix $P = \begin{bmatrix} \frac{3}{4} & 0 & 0 \\ \frac{1}{8} & \frac{3}{4} & \frac{1}{4} \\ \frac{1}{8} & \frac{1}{4} & \frac{3}{4} \end{bmatrix}$. * The eigenvalues are $\lambda_1 = 1$, $\lambda_2 = \frac{3}{4}$, and $\lambda_3 = \frac{1}{2}$. * Since $|\lambda_1| = 1$ and all other eigenvalues satisfy $|\lambda_j| < 1$, $\lambda_1$ is the *dominant eigenvalue*. * The corresponding eigenvector for $\lambda_1=1$ is $\mathbf{v}_1 = \begin{bmatrix} 0 \\ 1/2 \\ 1/2 \end{bmatrix}$. * Any initial state vector $\mathbf{x}_0$ can be written as a linear combination of eigenvectors: $\mathbf{x}_0 = \mathbf{v}_1 + c_2 \mathbf{v}_2 + c_3 \mathbf{v}_3$. * Applying the power method, the terms associated with the non-dominant eigenvalues decay to zero:$$\lim_{n \to \infty} P^n \mathbf{x}_0 = \lim_{n \to \infty} P^n (\mathbf{v}_1 + c_2 \mathbf{v}_2 + c_3 \mathbf{v}_3) = \mathbf{v}_1$$ * The limit is the unique steady state $\mathbf{v}_1$. ::: ## PageRank Algorithm The stochastic matrix $P$ for a real-world web graph may not guarantee a unique steady state (e.g., if there are dead ends or cyclical links). Google's solution is **teleportation**. ### The Google Matrix The Google matrix $G$ is constructed by blending the original stochastic transition matrix $P$ with a teleportation matrix: $$ G = \alpha P + (1 - \alpha) \mathbf{v} \mathbf{e}^T $$ * **Stochastic Matrix $P$:** Represents clicking a random link (with probability $\alpha$). * **Teleportation Parameter $\alpha$:** A value between $0$ and $1$ (usually $\alpha = 0.85$). * **Teleportation Matrix $\mathbf{v}\mathbf{e}^T$:** Represents jumping to a random page (with probability $1-\alpha$) according to the distribution $\mathbf{v}$ (often uniform, $\mathbf{v} = \frac{1}{N}\mathbf{e}$). **Process:** The Google matrix models the user process: with probability $\alpha$, they follow a link; with probability $1-\alpha$, they jump to a random page. ### The PageRank Vector :::danger **Theorem** By ensuring the Google matrix $G$ has all positive entries, the Perron-Frobenius theorem guarantees that $\lambda=1$ is the unique dominant eigenvalue. This ensures the existence of a unique steady state vector $\mathbf{x}$, known as the PageRank vector: $$G \mathbf{x} = \mathbf{x}$$The entry $x_i$ is the *PageRank* of the webpage at vertex $i$. The PageRank vector is computed efficiently using the *power method* because $\lambda=1$ is guaranteed to be dominant. ::: :::warning **Example** For the graph in the previous example, using $\alpha=0.85$ and a uniform teleportation vector, the PageRank vector is approximated as: $$\mathbf{x} \approx \begin{bmatrix} 0.2472 \\ 0.4681 \\ 0.0375 \\ 0.2472 \end{bmatrix}$$Webpage 2 has the highest PageRank, indicating it is the most important. ::: :::success **Exercise** 1. Determine whether the statement is *True* or *False*: It is necessary to compute all the eigenvectors of the Google matrix to find the PageRank vector of a directed graph. 2. Find the Google matrix $\alpha P + (1 - \alpha)\underline{v}\underline{e}^T$ for the directed graph using teleportation parameter $\alpha = 0.5$ and uniform distribution vector $\underline{v}$. Let $\underline{x}_0 = [1 \; 0 \; 0 \; 0]^T$ and use Python to compute 50 iterations of the power method to approximate the PageRank vector. ![Screenshot 2025-11-22 at 6.02.27 PM](https://hackmd.io/_uploads/SyZLiJeZWx.png) 3. Consider the same directed graph as in the example in this section on PageRank. As $\alpha \to 1$, describe what happens to the PageRank $x_3$ of vertex 3. ![image](https://hackmd.io/_uploads/B1fuwGTJ-e.png) 4. Let $G$ be the complete directed graph with $N$ vertices. In other words, there is an edge from each vertex to every other vertex in $G$ (excluding edges from a vertex to itself). Describe the Google matrix and the PageRank vector for the complete directed graph. :::