RWMC - HackMD

# Random Walks and Markov Chains ## Introduction Random walks and Markov chains are fundamental concepts with profound implications across computer science, statistics, and numerous other fields. This note explores their mathematical foundations, intricate connections, and powerful applications in solving real-world problems. ### Key Applications * **PageRank:** PageRank uses random walks on web graphs to determine page relevance. * **Markov Chain Monte Carlo (MCMC):** Markov Chain Monte Carlo (MCMC) enables efficient sampling from complex probability distributions. * **Statistical Mechanics:** Models molecular behavior and material properties by analyzing possible configurations. * **Finance:** Models stock price movements and option pricing. * **Epidemiology:** Simulates disease spread to predict and understand outbreaks. * **Natural Language Processing:** Underpins tasks from word prediction to text generation. ## Random Walks on Directed Graphs A random walk on a directed graph is a dynamic process that involves: 1. **Starting Point:** Selecting an initial vertex as the starting position. 2. **Edge Selection:** At each step, randomly choosing an outgoing edge from the current vertex based on predefined probabilities. 3. **Traversal:** Moving to the new vertex connected by the chosen edge. 4. **Iteration:** Repeating the process from the new vertex, creating a sequence of visited vertices. ### Key Concepts * **Strongly Connected Graphs:** For most analyses, the graph is assumed to be strongly connected, meaning a directed path exists between any two vertices. * **Stationary Distribution:** In strongly connected graphs, the fraction of time the random walk spends at each vertex converges to a stationary probability distribution, regardless of the starting point. ![Screenshot 2024-09-03 at 10.50.12 AM](https://hackmd.io/_uploads/SkgP26420.png) ### Mathematical Representation * **Initial State:** The walk commences at vertex $x$ with probability distribution $\mathbf{p}$, represented as a row vector where $p_x$ is the probability of starting at vertex $x$. * **State Update:** The probability of being at vertex $x$ at time $t+1$ is calculated by summing the probabilities of two events: * Being at an adjacent vertex $y$ at time $t$. * Transitioning from vertex $y$ to vertex $x$. * **Notation:** * $\mathbf{p}(t)$: Row vector encapsulating the probabilities at time $t$. * $P$: Transition probability matrix, where entry $ij$ signifies the probability of moving from vertex $i$ to vertex $j$. * **Matrix Update:** The state update can be concisely expressed as $\mathbf{p}(t) P = \mathbf{p}(t + 1)$. ### Fundamental Property * **Long-term Average:** The long-term average probability of being at a vertex is independent of the initial vertex/distribution, provided the graph is strongly connected. * **Stationary Probabilities:** These limiting probabilities, representing the equilibrium state of the random walk, are called stationary probabilities. ### Special Case: Undirected Graphs * **Connection to Electrical Networks:** Random walks on undirected graphs are linked to electrical networks. * **Metrics:** * **Hitting Time:** The expected time it takes for the random walk to reach a specific vertex from the starting point. * **Cover Time:** The expected time it takes for the random walk to visit every vertex in the graph at least once. * **Polynomial Bound:** For undirected graphs, these metrics are bounded by polynomials in the number of vertices, offering insights into the efficiency of random walk-based algorithms. ## Markov Chains Markov chains offer a formal, statistical framework for studying random walks. Conceptually, they are equivalent to random walks on directed graphs, but with a distinct development history within the field of statistics. ### Components * **States:** A finite collection of states, analogous to the vertices in a graph. * **Transition Probabilities:** Represented by $p_{xy}$, these quantify the probability of moving from state $x$ to state $y$, akin to the weights assigned to edges in a random walk. ### Key Terms * **Persistent State:** A state that, once entered, is guaranteed to be revisited in the future with probability 1. This corresponds to a strongly connected component in the graph representation. * **Aperiodic Chain:** The greatest common divisor of the lengths of all directed cycles in the underlying graph is 1. Intuitively, this prevents the random walk from getting trapped in periodic patterns. * **Ergodic Chain:** A Markov chain that is both connected (all states can be reached from any other state) and aperiodic. Ergodic chains exhibit a unique stationary distribution. * **Stationary Distribution:** The limiting distribution of the random walk on an ergodic Markov chain. This distribution represents the long-term probabilities of being in each state, regardless of the initial state. * **Mixing Time:** The time it takes for the random walk to get sufficiently close to the stationary distribution, quantifying how quickly the chain "forgets" its initial state. #### Additional Notes * **Convergence:** Even for connected chains that are not aperiodic, the average probability distribution converges to a limiting distribution. However, this limiting distribution may not be unique. * **Predictability:** Markov chains model systems where the future state depends solely on the current state, embodying the "memoryless" property. #### Correspondence between random walks and Markov chains | **Random walk** | **Markov chain** | |-----------------|-------------------| | graph | stochastic process | | vertex | state | | strongly connected | persistent | | aperiodic | aperiodic | | strongly connected and aperiodic | ergodic | | edge weighted undirected graph | time reversible | ### Stationary Distribution Let $\mathbf{p}(t)$ be the probability distribution after $t$ steps of a random walk. Define the long-term average probability distribution $\mathbf{a}(t)$ by $$ \mathbf{a}(t) = \frac{1}{t} [\mathbf{p}(0) + \mathbf{p}(1) + \dots + \mathbf{p}(t-1)] $$ #### Fundamental Theorem of Markov Chains **Theorem (Fundamental Theorem of Markov Chains)** For a connected Markov chain, there is a unique probability vector $\pi$ satisfying $\pi P = \pi$. Moreover, for any starting distribution, $\lim_{t \to \infty} \mathbf{a}(t)$ exists and equals $\pi$. **Lemma** Let $P$ be the transition probability matrix for a connected Markov chain. The $n \times (n + 1)$ matrix $A = [P - I, \mathbf{1}]$ obtained by augmenting the matrix $P - I$ with an additional column of ones has rank $n$. **Proof (by contradiction):** * **Assume:** $\text{rank}(A) < n$. This means the null space of $A$ has dimension at least 2. * **One solution:** Since each row of $P$ sums to 1, each row of $P-I$ sums to 0. Thus, $\mathbf{y} = (1, 1, \dots , 1, 0)$ is in the null space of $A$. * **Another solution?:** Assume there's another solution $(\mathbf{x}, \alpha)$ orthogonal to $\mathbf{y}$. * Orthogonality: $\mathbf{x} \cdot \mathbf{1} + \alpha \cdot 0 = 0$, so the sum of elements in $\mathbf{x}$ is 0. * In null space: $(P - I) \mathbf{x} + \alpha1 = 0$. Component-wise, this is $x_i = \sum_j p_{ij} x_j + \alpha$. * **Contradiction:** * Let $k$ be an index where $x_k$ is maximal, and $m$ be an index where $x_m$ is minimal. * Since the chain is connected, there's a path from $k$ to $m$. Along this path, there must be an edge from some $i$ in $S$ (where $x_i$ is maximal) to some $j$ not in $S$. * At $i$: $x_i = \sum_j p_{ij} x_j + \alpha$. Since $x_i$ is maximal and $x_j < x_i$, we must have $\alpha > 0$. * Similarly, at $m$: $x_m = \sum_j p_{mj} x_j + \alpha$. Since $x_m$ is minimal and there's a transition to a higher value, we must have $\alpha < 0$. * This is a contradiction. * **Conclusion:** The null space of $A$ has dimension 1, so $\text{rank}(A) = n$. $\Box$ **Proof of thm.** * It's clear that $\mathbf{a}(t)$ itself is a probability vector since its components are non-negative and sum to 1. * **Analyzing the change in probabilities after one step** * If we start the Markov chain with distribution $\mathbf{a}(t)$, after one step the distribution becomes $\mathbf{a}(t) P$. * Let's examine the change in probabilities due to this step: \begin{split} \mathbf{a}(t)P - \mathbf{a}(t) &= \frac1t [\mathbf{p}(0)P + \mathbf{p}(1)P + \dots+ \mathbf{p}(t-1)P] - \frac1t [\mathbf{p}(0) + \mathbf{p}(0)(1) + \dots+ \mathbf{p}(0)(t-1)] \\ &= \frac1t [\mathbf{p}(1) + \mathbf{p}(2) + \dots + \mathbf{p}(t)] - \frac1t [\mathbf{p}(0) + \mathbf{p}(1) + \dots + \mathbf{p}(t-1)] \\ &= \frac1t (\mathbf{p}(t) - \mathbf{p}(0)) \end{split} * Let's define $\mathbf{b}(t) = \mathbf{a}(t)P - \mathbf{a}(t)$. From the above, we see that: $$|\mathbf{b}(t)| \leq \frac2t \rightarrow 0 \text{ as } t \rightarrow \infty$$ * **Connecting to the matrix A and finding the stationary distribution** * From the Lemma, we know that the matrix $A = [P - I, 1]$ has rank $n$. * Let $B$ be the $n \times n$ submatrix of $A$ obtained by removing the first column. $B$ is invertible. * Let $\mathbf{c}(t)$ be $\mathbf{b}(t)$ with its first entry removed. * We have the following relationships: * $\mathbf{a}(t)P - \mathbf{a}(t) = \mathbf{b}(t)$ * $\mathbf{a}(t)B = [\mathbf{c}(t), 1]$ (This follows from how $B$ is constructed from $P - I$) * Therefore $\mathbf{a}(t) = [\mathbf{c}(t), 1]B^{-1}$ * As $t$ tends to infinity, $\mathbf{c}(t)$ tends to $0$ (since $\mathbf{b}(t)$ does). So: $$\mathbf{a}(t) \rightarrow [0,1]B^{-1} \text{ as } t \rightarrow \infty$$ * This establishes the theorem, and we can define the stationary distribution as: $$\pi = [0,1]B^{-1}$$ $\Box$ #### Identifying Stationary Distribution The following lemma provides a powerful tool for identifying the stationary distribution in the context of random walks on strongly connected graphs: **Lemma** For a random walk on a strongly connected graph with probabilities on the edges, if the vector $\pi$ satisfies * **Detailed balance equation:** $\pi_x p_{xy} = \pi_y p_{yx}$ for all $x$ and $y$ * **Probability distribution:** $\sum_x \pi_x = 1$ then $\pi$ is the stationary distribution of the walk. **Proof.** 1. **Detailed Balance Implies Global Balance:** * Sum both sides of the detailed balance equation over all vertices $y$: $$\pi_x = \sum_y \pi_x p_{xy} = \sum_y \pi_y p_{yx}$$ * This equation holds for all vertices $\mathbf{x}$. In matrix form, this can be written as $\pi = \pi P$ 2. **Uniqueness of the Stationary Distribution** * By Fundamental Theorem of Markov Chains, $\pi$ is the unique stationary probability. $\Box$ ## Markov Chain Monte Carlo (MCMC) Markov Chain Monte Carlo (MCMC) methods are powerful tools in computational statistics and machine learning, enabling us to sample from and analyze complex, high-dimensional probability distributions. MCMC addresses the challenge of estimating the expected value of a function $f(\mathbf{x})$ under a given probability distribution $p(\mathbf{x})$, where $\mathbf{x} = (x_1, \dots, x_d)$. This expected value is formally expressed as: $$ \mathbb{E}(f) = \sum_{\mathbf{x}} f(\mathbf{x}) p(\mathbf{x}) $$ Direct computation of this sum can be computationally expensive, especially when the space of possible \mathbf{x} values is vast. MCMC offers an elegant solution by constructing a Markov chain whose stationary distribution mirrors the target distribution $p(\mathbf{x})$. This allows us to generate samples from this distribution and estimate the expected value by averaging the function f over these samples. ### Designing the Markov Chain The heart of MCMC lies in designing this Markov chain. It requires balancing two crucial aspects: 1. **Representativeness:** The chain's states should adequately represent the possible values of $\mathbf{x}$. 2. **Convergence:** The chain's long-term behavior should converge to the desired stationary distribution $p(\mathbf{x})$. Popular techniques for achieving this balance include the *Metropolis-Hastings algorithm* and *Gibbs sampling*. #### Key Idea and Implementation MCMC leverages the ergodic theorem, which states that the time average of a function over the states visited by an ergodic Markov chain converges to the expected value of that function under the stationary distribution. #### Estimating the Expected Value * Let $f_i$ be the value of the function $f$ at state $i$. The expected value of $f$ under the target distribution $p$ is: $$\mathbb{E}(f) = \sum_i f_i p_i$$ * MCMC approximates this expected value by averaging the function's values over the states encountered during a $t$-step random walk on the Markov chain. Let's denote this MCMC estimate as $\gamma$. * The expected value of this estimate $\gamma$ is: $$\mathbb{E}(\gamma) = \sum_i f_i \left(\frac1t \text{Prob(walk is in state i at time j)} \right) = \sum_i f_i a_i(t)$$ where $a_i(t)$ is the $t$-step average probability of being in state $i$. * The accuracy of this estimate is captured by the following error bound: $$\left|\sum_i f_i p_i - \mathbb{E}(\gamma) \right| \leq f_{max} \cdot ||\mathbf{p} - \mathbf{a}(t)||_1$$ where: * $f_{max}$ is the maximum absolute value of $f$. * $||\mathbf{p} - \mathbf{a}(t)||_1$ is the total variation distance (or $\ell_1$ distance) between the target distribution $p$ and the $t$-step average distribution $\mathbf{a}(t)$. #### Goal & Convergence The central goal of MCMC is to ensure that the Markov chain converges rapidly to its stationary distribution, minimizing the total variation distance $||\mathbf{p} - \mathbf{a}(t)||_1$ and thus the error in our estimate. The rate of this convergence dictates the number of steps $t$ required for the MCMC estimate to be sufficiently accurate. #### Measuring Convergence To quantify the distance between probability distributions, we use the total variation distance, defined as: **Proposition** For two probability distributions $p$ and $q$: $$ ||p - q||_1 = 2 \sum_i (p_i - q_i)_+ = 2 \sum_i (q_i - p_i)_+ $$ where $x_+ = \max\{x,0\}$. Bonus proof ### Metropolis-Hasting Algorithm The Metropolis-Hastings algorithm is a cornerstone of MCMC methods, offering a flexible approach to constructing Markov chains with a desired stationary distribution. It operates on a connected, undirected graph $G$, where each vertex represents a possible state of the system being modeled. The algorithm guides a random walk on this graph, carefully crafting transition probabilities to ensure convergence to the target distribution $\mathbf{p}$. #### The Process The transition probabilities are defined as follows: 1. **Proposal:** At the current state $i$, a neighboring state $j$ is proposed with probability $1/r$, where $r$ is the maximum degree of any vertex in the graph. If no neighbor is selected, the walk remains at state $i$. 2. Acceptance or Rejection: If a neighbor $j$ is proposed, the transition to $j$ is either accepted or rejected based on the following criteria: * **Acceptance:** If the proposed state $j$ has a higher or equal probability under the target distribution ($p_j \geq p_i$), the transition is accepted, and the walk moves to state $j$. * **Rejection with a Chance:** If the proposed state $j$ has a lower probability ($p_j < p_i$), the transition is accepted with probability $p_j/p_i$. Otherwise, the transition is rejected, and the walk remains at state $i$. #### Transition Probabilities Mathematically, the transition probabilities can be expressed as: * $p_{ij} = \frac1r \min(1, \frac{p_j}{p_i})$ (probability of transitioning from state $i$ to state $j$) * $p_{ii} = 1-\sum_{j\neq i}p_{ij}$ (probability of staying at state $i$) #### Stationary Distribution The Metropolis-Hastings algorithm guarantees that the constructed Markov chain has the desired stationary distribution $\mathbf{p}$. This can be verified by showing that the detailed balance equations are satisfied: $$ p_i p_{ij} = \frac{p_i}{r} \min(1, \frac{p_j}{p_i}) = \frac1r \min(p_i, p_j) = \frac{p_j}{r} \min(1, \frac{p_i}{p_j}) = p_j p_{ji} $$ #### Example Consider a graph with four states $\{a, b, c, d \}$ and desired stationary probabilities: $$ p(a) = 1/2,\ p(b) = 1/4,\ p(c) = 1/8,\ p(d) = 1/8 $$ ![Screenshot 2024-09-03 at 12.12.36 PM](https://hackmd.io/_uploads/ryl2kkH3A.png) * The maximum degree is 3. * Transition probabilities are assigned using the Metropolis-Hastings algorithm to achieve the desired stationary distribution. * The example demonstrates how the algorithm ensures that the detailed balance equations are satisfied, leading to the desired stationary distribution. Bonus example #### Advantages and Considerations * **Flexibility:** The Metropolis-Hastings algorithm is highly flexible, as it allows for a wide range of proposal distributions. * **Applicability:** It can be applied to various problems where the target distribution is known up to a constant of proportionality. * **Convergence:** The rate of convergence to the stationary distribution depends on the choice of the proposal distribution. A good proposal distribution should lead to efficient exploration of the state space. * **Implementation:** While conceptually simple, implementing the Metropolis-Hastings algorithm requires careful consideration of the proposal distribution and the acceptance/rejection criteria. ### Gibbs Sampling Gibbs sampling is another prominent MCMC algorithm, particularly well-suited for scenarios where the target distribution is multivariate and the conditional distributions of individual variables are easy to sample from. It operates on a graph where each vertex represents a possible configuration of the variables in the multivariate distribution $p(\mathbf{x})$. An edge connects two vertices if and only if their configurations differ in exactly one variable. This structure results in a graph resembling a high-dimensional lattice, with connections forming along the axes representing each variable. #### The Process 1. **Define the graph:** * Each vertex corresponds to a unique configuration of the variables $\mathbf{x} = (x_1, \dots, x_d)$. * An edge exists between two vertices if and only if their configurations differ in exactly one coordinate. * The resulting graph resembles a $d$-dimensional lattice. 2. **Generate samples:** * Initialize the process with an arbitrary starting configuration. * Repeatedly choose one variable $x_i$ to update. * The new value for $x_i$ is sampled from its conditional distribution, conditioned on the current values of all other variables. * Two common schemes for choosing the variable to update: * **Random Selection:** Select a variable $x_i$ uniformly at random from the set of all variables. * **Systematic Scan:** Update the variables sequentially, cycling through them in a fixed order (e.g., from $x_1$ to $x_d$). #### Transition Probabilities * The probability of transitioning from configuration $\mathbf{x}$ to configuration $\mathbf{y}$, denoted as $p_{\mathbf{x}\mathbf{y}}$, is determined by the conditional distribution of the updated variable. * If $\mathbf{x}$ and $\mathbf{y}$ differ only in the $i$-th coordinate, then: $$p_{\mathbf{x}\mathbf{y}} = \frac1d p(y_i | x_1, \dots, x_{i-1}, x_{i+1}, \dots, x_d)$$ where $d$ is the total number of variables. * Similarly, the probability of transitioning back from $\mathbf{y}$ to $\mathbf{x}$ is: $$p_{\mathbf{y}\mathbf{x}} = \frac1d p(x_i | y_1, \dots, y_{i-1}, y_{i+1}, \dots, y_d) = \frac1d p(x_i | x_2, x_3, \dots, x_d)$$ #### Stationary Distribution * A key property of Gibbs sampling is that the Markov chain it constructs has a stationary distribution that is proportional to the target distribution $p(\mathbf{x})$. * This can be proven by showing that the detailed balance equations are satisfied: \begin{split} p_{\mathbf{x}\mathbf{y}} &= \frac1d \frac{p(y_i | x_1, \dots, x_{i-1}, x_{i+1}, \dots, x_d) p(x_1, \dots, x_{i-1}, x_{i+1}, \dots, x_d)}{p(x_1, \dots, x_{i-1}, x_{i+1}, \dots, x_d)} \\ &= \frac1d \frac{p(x_1, \dots, x_{i-1}, y_i, x_{i+1}, \dots, x_d)}{p(x_1, \dots, x_{i-1}, x_{i+1}, \dots, x_d)} = \frac1d \frac{p(\mathbf{y})}{p(x_1, \dots, x_{i-1}, x_{i+1}, \dots, x_d)}\\ p_{\mathbf{y}\mathbf{x}} &= \frac1d \frac{p(\mathbf{x})}{p(x_1, \dots, x_{i-1}, x_{i+1}, \dots, x_d)} \end{split} #### Example ![Screenshot 2024-09-03 at 12.29.48 PM](https://hackmd.io/_uploads/rkFnXyB30.png) * The provided image illustrates the Gibbs sampling algorithm on a simple graph. * It demonstrates how to calculate edge probabilities and verify that the stationary distribution is achieved. * For a deeper understanding, consider exploring additional examples or delving into the theoretical foundations of Gibbs sampling. Bonus example #### Advantages and Considerations * **Efficiency:** Gibbs sampling can be very efficient when conditional distributions are easy to sample from. * **Applicability:** It is particularly useful for problems with high-dimensional distributions where direct sampling is challenging. * **Convergence:** The rate of convergence depends on the correlation between variables. Highly correlated variables can lead to slower convergence. * **Implementation:** Implementing Gibbs sampling requires knowledge of the conditional distributions and efficient sampling techniques. ## Estimating Areas and Volumes The classical problem of computing areas and volumes has well-established solutions for many regular shapes. However, when dealing with general convex sets in high-dimensional space, such formulas become elusive. Markov Chain Monte Carlo (MCMC) methods emerge as a powerful tool in this scenario, offering a way to estimate volumes even in high dimensions with polynomial-time complexity. ### Challenges in high dimensions Traditional methods, like enclosing the region within a simple shape (e.g., a hypercube) and sampling points within it, falter in high dimensions. The volume of the enclosing space grows exponentially with the number of dimensions, leading to inefficient sampling and inaccurate estimates. ### MCMC to the Rescue The core idea behind the MCMC approach is to ingeniously reduce the problem of estimating volumes to the task of drawing uniform random samples from sets. ### The Process 1. **Concentric Spheres:** * Construct a series of concentric spheres, $S_1, S_2, \dots, S_k$, such that: * The smallest sphere, $S_1$, lies entirely within the convex set $R$. * The largest sphere, $S_k$, completely encompasses the convex set $R$. 2. **Volume Decomposition:** * The volume of the convex set $R$ is expressed as a product of ratios of volumes of intersections with these spheres: \begin{split} \text{Vol}(R) &= \text{Vol}(S_k \cap R) \\ &= \frac{\text{Vol}(S_k \cap R)}{\text{Vol}(S_{k-1} \cap R)} \frac{\text{Vol}(S_{k-1} \cap R)}{\text{Vol}(S_{k-2} \cap R)} \cdots \frac{\text{Vol}(S_2 \cap R)}{\text{Vol}(S_1 \cap R)} \text{Vol}(S_1) \end{split} 3. **Rejection Sampling for Ratios:** * The radii of the spheres are chosen such that the radius of $S_i$ is slightly larger (by a factor of $1+1/d$) than the radius of $S_{i−1}$. This ensures that each ratio in the product is bounded: $$1 \leq \frac{\text{Vol}(S_i \cap R)}{\text{Vol}(S_{i-1} \cap R)} < e$$ * Each ratio can then be estimated using rejection sampling: * Uniformly sample points from $S_i \cap R$. * Calculate the fraction of these points that also lie within $S_{i-1} \cap R$. 4. **Bounding the Number of Spheres:** * The number of spheres needed is determined by the ratio of the largest sphere's radius to the smallest sphere's radius, denoted by $r$. * The number of spheres, $k$, is: $$k = O(\log_{1 + 1/d} r) = O(d \cdot \ln r)$$ * To achieve an overall volume estimate with an error of $1 \pm \epsilon$, each ratio needs to be estimated to within a factor of $1 \pm \frac{\epsilon}{ed \ln r}$. ### Uniform Sampling: The Key The success of this MCMC approach hinges on the ability to efficiently generate uniform random samples from the intersections of the convex set with the concentric spheres. This is achieved by superimposing a grid on the region of interest and performing a random walk on the grid points. * At each step: * One of the $2d$ coordinate neighbors is chosen with probability $1/2d$. * The walk moves to the chosen neighbor if it lies within the set; otherwise, it stays at the current point. The Markov chain associated with this random walk exhibits rapid mixing, leading to a polynomial-time algorithm for volume estimation. ![Screenshot 2024-09-03 at 12.59.56 PM](https://hackmd.io/_uploads/Hyp65kHnR.png) ### Key points * MCMC provides an efficient way to estimate volumes of high-dimensional convex sets * The method hinges on reducing volume estimation to the task of uniform sampling * Rejection sampling is employed to estimate volume ratios * The number of spheres required is logarithmic in the ratio of the largest to smallest sphere's radius. * The random walk on a grid enables efficient uniform sampling, crucial for the success of the algorithm. ## Convergence of Random Walks on Undirected Graphs In the context of random walks on undirected graphs where the detailed balance equation holds ($\pi_x p_{xy} = \pi_y p_{yx}$), we can assign weights $w_{xy}$ to edges such that the transition probability from vertex $x$ to vertex $y$ is: $$ p_{xy} = \frac{w_{xy}}{\sum_y w_{xy}} $$ ### Stationary Distribution * The stationary distribution π is directly proportional to the vertex weights $w_x = \sum_{y} w_{xy}$: $$\pi_x = \frac{w_x}{w_{total}}$$ where $w_{total} = \sum_{x'} w_{x'}$. * This relationship stems from the fact that $$w_x p_{xy} = w_x \frac{w_{xy}}{w_x} = w_{xy} = w_y p_{yx}$$ satisfying the detailed balance equation, which confirms $\pi$ as the stationary distribution. ### Convergence Time (Mixing Time) * The speed at which the random walk approaches its stationary distribution is crucial for the efficiency of algorithms like Metropolis-Hastings and Gibbs sampling. * **$\epsilon$-mixing time:** The minimum time $t$ required for the total variation distance between the $t$-step running average distribution and the stationary distribution to be at most $\epsilon$, regardless of the starting distribution. * **Constrictions (Bottlenecks):** Narrow passages in the graph can impede convergence, slowing down the mixing time. ### Normalized Conductance $\Phi$ * A quantitative measure of how constricted a graph is. * For a subset of vertices $S$, let $\pi(S)$ denote the total stationary probability of vertices in $S$, i.e., $\sum_{x \in S} \pi_x$. The normalized conductance $\Phi(S)$ of $S$ is: $$\Phi(S) = \frac{\sum_{(x,y) \in (S,\bar{S})} \pi_x p_{xy}}{\min(\pi(S), \pi(\bar{S}))}$$ * The normalized conductance of the entire Markov chain, denoted by $\Phi$, is the minimum normalized conductance over all non-empty subsets of vertices: $$\Phi = \min_{S \subset V, S \neq \phi} \Phi(S)$$ * A high $\Phi$ indicates good connectivity and is both necessary and sufficient for rapid mixing. * Mixing time can be significantly shorter than the cover time (time to visit all states). * For expanders (graphs with $\Phi$ bounded below by a constant), the mixing time is logarithmic in the number of states, implying exceptionally fast convergence. **Theorem** The $\epsilon$-mixing time of a random walk on an undirected graph is bounded by: $$ O\left( \frac{\ln( \frac{1}{\pi_{min}})}{\Phi^2 \epsilon^3} \right) $$ where: * $\pi_{min}$ is the minimum stationary probability of any state. * $\Phi$ is the normalized conductance of the Markov chain. Bonus proof #### Applications of Normalized Conductance The theorem above provides a powerful tool for analyzing the convergence behavior of random walks on various graph structures. By examining the normalized conductance, we can gain insights into the mixing time and efficiency of algorithms relying on these walks. ### Examples #### Random Walk on a Line Consider a random walk on a linear graph (essentially a path) with $n$ vertices. To ensure the chain is connected, we add self-loops at both ends. * **Transition Probabilities & Stationary Distribution:** Due to the self-loops, every edge has a transition probability of $1/2$. This symmetry leads to a uniform stationary distribution, where each vertex has a probability of $1/n$. * **Minimum Normalized Conductance:** * We want to identify the subset of vertices $S$ with $\pi(S) \leq 1/2$ that minimizes the normalized conductance $\Phi(S)$. * The set $S$ that achieves this is the first half of the vertices (or $n/2$ vertices, rounding down if $n$ is odd). * For this set: * The probability mass exiting $S$ is: $$\sum_{(x,y)\in(S,\bar{S})} \pi_x p_{xy} = \frac1{2n}$$ (This is because only one edge connects $S$ to its complement, and the probability of crossing this edge is $1/2$) * The probability mass inside $S$ is: $\pi(S) = 1/2$ * Therefore: $\Phi(S) = (1/2n) / (1/2) = 1/n$ * **Convergence Time (Mixing Time):** * Applying the theorem with $\epsilon>0$ and $\Phi= 1/n$, after $$O \left(\frac{\ln(1/(1/n))}{(1/n)^2 \epsilon^3)} \right) = O\left(\frac{n^2 \ln n}{\epsilon^3} \right)$$ steps, $||\mathbf{a} - \pi||_1 \leq \epsilon$ * This quadratic dependence on $n$ indicates slow convergence, implying that the random walk on a line does not exhibit rapid mixing. * It's worth noting that both the hitting time (expected time to reach a specific vertex) and the cover time (expected time to visit all vertices) for this graph are also $O(n^2)$. #### Random Walk on a 2D Grid Now, consider a random walk on an $n \times n$ grid (or lattice) in the plane. * **Transitions and Stationary Distribution:** * At each step, the walk moves to one of its four neighboring points with equal probability $1/4$. * To ensure the Markov chain is connected, boundary points have self-loops. * Due to the symmetry of the transition probabilities ($p_{ij} = p_{ji}$) and the lemma discussed earlier, the stationary distribution is uniform, with each vertex having a probability of $1/n^2$. * **Analyzing Normalized Conductance:** * We aim to find the minimum normalized conductance $\Phi$ by considering subsets $S$ with at most half the total number of states: $\pi(S) \leq 1/2$ * **Case 1: Large Subsets $(|S| \geq n^2/4)$** * The subset $S$ with the fewest outgoing edges is likely to be a collection of complete columns (and possibly a partial column). * The number of edges leaving $S$ is at least $n$. * The total probability mass flowing out of $S$ is: $$\sum_{i\in S} \sum_{j \in \bar{S}} \pi_i p_{ij} \geq \Omega(n (1/n^2)) = \Omega(1/n)$$ * The normalized conductance for this $S$ is: $$\Phi(S) \geq \Omega\left(\frac{1/n}{\min(S/n^2, \bar{S}/n^2)}\right) \geq \Omega\left(\frac1n \right)$$ * **Case 2: Small Subsets $(|S| < n^2/4)$** * The subset $S$ that minimizes the number of outgoing edges is likely to be a square region at one of the corners of the grid. * The boundary of $S$ (points in $S$ adjacent to points in $\bar{S}$) has at least $2\sqrt{|S|}$ points * The total probability mass flowing out of $S$ is: $$\sum_{i\in S} \sum_{j∈\in \bar{S}} \pi_i p_{ij} \geq \frac{c \sqrt{|S|}}{n^2}$$ (for some constant $c$) * The normalized conductance for this $S$ is: $$\Phi(S) \geq \frac{c \sqrt{|S|}/n^2}{|S|/n^2} = \frac{c}{\sqrt{|S|}} = \Omega\left(\frac1n \right)$$ (since $|S| < n^2/4$) * **Convergence Time (Mixing Time):** * In both cases, the normalized conductance $\Phi$ is $\Omega(1/n)$. * Applying the mixing time theorem, the mixing time for a 2D grid is: $$O\left(\frac{n^2 \ln(n^2)}{(1/n)^2 \epsilon^3} \right) = O\left(\frac{n^2 \ln n}{\epsilon^3}\right)$$ Bonus examples: * A Lattice in d Dimensions * A Clique * A Connected Undirected Graph * The Gaussian Distribution on the Interval $[−1,1]$ ## Electrical Networks and Random Walks This section unveils a captivating connection between the seemingly disparate worlds of electrical networks and random walks on undirected graphs. We will explore how concepts from circuit theory, such as voltage, current, and resistance, can be elegantly translated into the language of probabilities and random walks, offering a fresh perspective on both domains. ### Key Concepts * **Electrical Network:** A connected, undirected graph where each edge $(x, y)$ is assigned a resistance $r_{xy} > 0$. The conductance of an edge, denoted by $c_{xy}$, is simply the reciprocal of its resistance: $c_{xy} = 1/r_{xy}$. * **Random Walk on Electrical Network:** Imagine a random walk on the underlying graph, where the transition probabilities are governed by the conductances. The probability of moving from vertex $x$ to vertex $y$ is given by $p_{xy} = c_{xy} / c_x$, where $c_x = \sum_y c_{xy}$ represents the total conductance at vertex $x$. * **Stationary Distribution:** This random walk possesses a unique stationary probability distribution, where the probability of being at vertex $x$ is $\pi_x = \frac{c_x}{c_0}$ with $c_0 = \sum_x c_x$ being the total conductance of the network. * **Harmonic Function:** A function $g$ defined on the vertices of the graph is called harmonic if, at any interior vertex $x$, its value is a weighted average of its values at the neighboring vertices $y$. The weights in this average are the transition probabilities $p_{xy}$, and they satisfy the condition $\sum_y p_{xy} = 1$. #### Example: Harmonic Function ![Screenshot 2024-09-15 at 2.56.57 PM](https://hackmd.io/_uploads/HJSSSZN6A.png) Bonus example ### Bridging the Gap The relationship between electrical networks and random walks is rich and multifaceted. Let's explore some of the key connections: * **Voltages as Harmonic Functions:** When an electrical network is subjected to a voltage source between two vertices, say $a$ and $b$, with $v_a = 1$ and $v_b = 0$, the voltages at all other vertices in the network form a harmonic function. * **Probabilistic Interpretation of Voltages:** The voltage at a vertex $x$ can be interpreted as the probability that a random walk starting at $x$ will reach vertex $a$ before reaching vertex $b$. * **Probabilistic Interpretation of Current:** The current flowing through an edge $xy$, denoted by $i_{xy}$, corresponds to the net frequency with which a random walk from $a$ to $b$ traverses the edge $xy$ before reaching $b$. * **Effective Resistance and Escape Probability:** * The effective resistance between $a$ and $b$, denoted by $r_{eff}$, is given by $r_{eff} = \frac{v_a}{i_a}$, where $i_a$ is the current flowing into the network at $a$ and out at $b$. * The effective conductance is simply the reciprocal: $c_{eff} = \frac1{r_{eff}}$. * The escape probability, $p_{escape}$, is the probability that a random walk starting at $a$ reaches $b$ before returning to $a$. * These quantities are related by: $p_{escape} = \frac{c_{eff}}{c_a}$. ### Key Takeaways * The interplay between electrical networks and random walks offers a powerful lens through which we can analyze and understand both phenomena. * Harmonic functions play a central role in bridging the gap between these two seemingly distinct domains. * Concepts like effective resistance and escape probability provide valuable insights into the behavior of random walks, particularly on infinite graphs. ## Random Walks on Undirected Graphs with Unit Edge Weights This section focuses on the specific scenario of random walks on undirected graphs where all edges carry equal weight. This implies a uniform probability of traversing any edge from a given vertex, akin to an electrical network where all edges have the same resistance. We will explore key questions about the expected time it takes for a random walk to reach specific vertices or cover all vertices within the graph. ### Hitting Time * **Definition:** The hitting time, denoted by $h_{xy}$, is the expected number of steps a random walk starting at vertex $x$ takes to reach vertex $y$. * **Key Observations:** * Counterintuitively, adding edges to a graph can either increase or decrease the hitting time between two vertices. * Hitting time is not symmetric; that is, $h_{xy}$ may not equal $h_{yx}$, even in undirected graphs. **Lemma (Path Traversal Time)** The expected time for a random walk to traverse a path consisting of $n$ vertices is $\Theta(n^2)$. Bonus proof **Lemma (Time Spent at Intermediate Vertices)** In a random walk from vertex $1$ to vertex $n$ in a chain of $n$ vertices, the expected time spent at an intermediate vertex $i$ (where $2 \leq i \leq n-1$) is $2(n-i)$. Bonus proof #### Illustrative Examples ![Screenshot 2024-09-15 at 1.56.02 PM](https://hackmd.io/_uploads/Sk0xPlNaA.png) Bonus example ### Commute Time * **Definition:** The commute time, denoted by $\text{commute}(x, y)$, is the expected time for a random walk starting at vertex $x$ to reach vertex $y$ and then return to $x$. * **Relationship to Hitting Time:** $\text{commute}(x, y) = h_{xy} + h_{yx}$ * **Symmetry:** Commute time is symmetric; that is, $\text{commute}(x, y) = \text{commute}(y, x)$. **Theorem** In a connected, undirected graph where each edge is equivalent to a 1-ohm resistor, the commute time between vertices $x$ and $y$ is given by: $$ \text{commute}(x, y)= 2m r_{xy} $$ where * $r_{xy}$ is the effective resistance between $x$ and $y$. * $m$ is the total number of edges in the graph. Bonus proof This theorem establishes a profound connection between commute time in random walks and effective resistance in electrical networks **Corollary** * If $x$ and $y$ are directly connected by an edge, then $h_{xy} + h_{yx} \leq 2m$. * For any two vertices $x$ and $y$ in an $n$-vertex graph, $\text{commute}(x, y) \leq n^3$. Bonus proof ### Cover Time * **Definition:** The cover time, denoted by $\text{cover}(x, G)$, is the expected time for a random walk starting at vertex $x$ to visit every vertex in graph $G$ at least once. * **Cover Time of $G$:** The cover time of the graph $G$, denoted by $\text{cover}(G)$, is the maximum cover time over all possible starting vertices in $G$. It represents the worst-case scenario for covering all vertices. * **Impact of Adding Edges:** Similar to hitting time, adding edges to a graph can either increase or decrease its cover time.This non-monotonicity underscores the intricate relationship between graph structure and random walk behavior. **Theorem (Upper Bound on Cover Time)** The cover time of a connected graph $G$ with $n$ vertices and $m$ edges is at most $4m(n-1)$. Bonus proof **Theorem (Cover Time and Effective Resistance)** Let $G$ be an undirected graph with $m$ edges. The cover time for $G$ is bounded by: $$ m \cdot r_{eff}(G) \leq \text{cover}(G) \leq 6e \cdot m \cdot r_{eff}(G) \cdot \ln(n) + n $$ where: * $e \approx 2.718$ is Euler's constant. * $r_{eff}(G)$ is the maximum effective resistance between any two vertices in $G$. Bonus proof This theorem establishes a crucial link between cover time and effective resistance, offering a way to estimate cover time based on electrical network properties. ## Random Walks in Euclidean Space This section extends our exploration of random walks from the discrete realm of graphs to the continuous domain of Euclidean space. We will establish connections between random walks on lattices (discrete grids) and their continuous counterparts, and investigate the intriguing phenomenon of escape probabilities - the chance of a random walk venturing to infinity without returning to its origin. ### Random Walks on Lattices * **Central Question:** In a random walk on a finite segment of a one-dimensional lattice, does the walk inevitably return to the origin, or is there a chance of it escaping to the boundary? * **Escape Probability:** This is the probability that a random walk, starting at the origin, reaches the boundary before returning to its starting point. * **Electrical Network Analogy:** We can gain insights into this problem by drawing a parallel to electrical networks. Replace each edge in the lattice with a 1-ohm resistor, transforming the lattice into an electrical circuit. * **Calculating Escape Probability:** The escape probability can be computed using the following formula: $$p_{escape} = \frac{c_{eff}}{c_a}$$ where: * $c_{eff}$: Effective conductance between the origin and the boundary points * $c_a$: Sum of conductances at the origin (equal to $2d$ in a $d$-dimensional lattice) ### One Dimension * **The Setup:** The electrical network equivalent to a 1D lattice is two series connections of $n$ 1-ohm resistors arranged in parallel. * **Behavior as $n$ Grows:** As $n$ (the size of the lattice segment) approaches infinity, the effective resistance $r_{eff}$ between the origin and the boundary also tends to infinity. Consequently, the escape probability $p_{escape}$ approaches zero. * **Interesting Note:** While the walk is guaranteed to return, the expected time for this return (commute time between the origin and a point one step away) is infinite! ### Two Dimensions * **The Setup:** Imagine a square grid in the plane, with the origin at its center. We consider increasingly larger squares as boundaries. * **Simplifying the Network:** To analyze the effective resistance, we can "short" (connect directly) the resistors along the boundary of each square. This simplification results in a linear network with an increasing number of resistors connected in parallel. * **Resistance Calculation:** The resistance between vertex $i$ and $i+1$ in this linear network is $1 / 4(2i + 1)$. * **Effective Resistance:** The effective resistance $r_{eff}$ between the origin and the boundary is at least $\Theta(\ln n)$: $$r_{eff} \geq \sum_{i = 1}^n \frac{1}{4(2i + 1)} = \frac14 \sum_{i = 0}^{n-1} \frac1{2i+1} = \Theta(\ln n)$$ where $n$ is the side length of the square. * **Conclusion:** As the square boundary grows larger ($n \to \infty$), the escape probability $p_{escape}$ approaches zero. Thus, even in an unbounded two-dimensional lattice, a random walk will eventually return to the origin with probability 1. ![Screenshot 2024-09-15 at 2.25.37 PM](https://hackmd.io/_uploads/S16yRlN6R.png) ### Three Dimensions * **The Dynamics:** In a three-dimensional lattice, the resistance along any path to infinity increases. However, the number of possible paths to infinity also grows. * **The Outcome:** In this case, there are enough parallel paths such that the effective resistance $r_{eff}$ remains finite. This leads to a non-zero escape probability, meaning there's a chance the random walk will never return to its origin. Bonus explanation ## The Web as a Markov Chain The World Wide Web, with its vast network of interconnected pages, can be elegantly modeled as a Markov chain. This perspective allows us to apply the powerful tools and insights from random walk theory to understand and analyze the web's structure and dynamics. One particularly impactful application lies in the realm of search engine ranking, where algorithms like PageRank leverage the principles of random walks to determine the relative importance of web pages. ### Search Engine Ranking: The Challenge Search engines face the daunting task of presenting users with an ordered list of web pages that are most relevant to their queries. This involves several key steps: 1. **Identifying Relevant Pages:** A "reverse index" is used to quickly pinpoint pages containing the query terms. 2. **Ranking Pages:** All web pages are ranked offline, ensuring swift responses to user queries without the need for real-time computations. 3. **Presenting Results:** The query-relevant pages are displayed in the pre-computed ranked order, providing users with the most pertinent information first. ### Random Walks to the Rescue One approach to ranking web pages is to model the web as a directed graph, often referred to as the web graph. In this graph: * Web pages represent vertices. * Hyperlinks between pages act as directed edges. A random walk on this graph can then be used to rank pages based on their stationary probability, which reflects the long-term frequency with which a random surfer would visit each page. However, challenges arise when applying random walks directly to the web graph: * **Disconnected Components:** The web graph may not be strongly connected, meaning that some pages may be unreachable from others. * **Dangling Nodes:** Pages with no outgoing links (dangling nodes) can cause the random walk to terminate prematurely. * **Spider Traps:** Groups of pages with only internal links can trap the random walk, preventing it from exploring the rest of the web. ### The Solution: Random Teleports To address these challenges, we introduce the concept of random teleports (also known as random restarts or jumps). At each step of the random walk: * With probability $r$, the surfer jumps to a randomly chosen page on the web. * With the remaining probability $1-r$, the surfer follows a random outgoing link from the current page (if one exists). * For dangling nodes, the teleport probability is set to $r = 1$. This modification ensures that the Markov chain representing the web graph is ergodic (connected and aperiodic), guaranteeing the existence of a unique stationary distribution. ### PageRank: The Essence of Web Importance PageRank is defined as the stationary probability of a vertex in the web graph with random teleports. Typically, the teleport probability $r$ is set to around 0.15. * **Interpretation:** PageRank measures the long-term frequency with which a random surfer would visit a page, providing a quantitative measure of its importance or authority within the web. * **Return Time:** The return time of a page, which is the expected time for a random walk starting at that page to return to it, is inversely proportional to its PageRank: $$\text{return time} = 1 / \text{PageRank}$$ * **Impact of Short Cycles:** Short cycles in the web graph, such as self-loops or reciprocal links, can artificially inflate a page's PageRank by reducing its return time. However, the random teleport mechanism mitigates this effect. ### PageRank, Hitting Time, and Spam * **Hitting Time:** The hitting time $h_y$ is the average time it takes for a random walk starting from a random node to reach a specific vertex $y$. * **Connecting PageRank and Hitting Time:** The return time of a page is related to its hitting time and the teleport probability: $$2(1-r)^2 + (2-r) \cdot r \cdot \text{hitting time} \leq \ \text{return time} \leq \ 1/r + \text{hitting time}$$ * **Computing Hitting Time:** While hitting time can be computed for a single vertex by removing its out-edges and running PageRank, this approach is inefficient for computing hitting times for all vertices. Random walks can be used to estimate hitting times for vertices with relatively low hitting times. * **Spam and Countermeasures:** Some web pages may attempt to artificially inflate their PageRank through techniques like creating short cycles or employing a "star" structure with many incoming links. Search engines combat such spam by: * Adjusting the random teleport frequency * Identifying pages with high PageRank but suspiciously low hitting time ### Personalized PageRank * **Normal PageRank:** In the standard PageRank algorithm, random restarts occur uniformly at random, meaning any page on the web has an equal chance of being chosen as the restart point. * **Personalized PageRank:** This variation introduces a personalized probability distribution for random restarts, often favoring a specific vertex or a neighborhood of vertices. This allows for tailoring search results to individual users or specific topics. * **Challenge:** The introduction of personalized teleports can potentially disconnect the web graph, requiring careful handling to ensure the existence of a stationary distribution. ### Computing PageRank * **Mathematical Formulation:** * Let $\alpha$ be the teleport probability. * Let $A$ be the adjacency matrix of the web graph, with rows normalized to sum to one. * Let $\mathbf{p}$ be the PageRank vector. * The PageRank equation is: * $\mathbf{p} = \frac{\alpha}{n} (1, 1, \dots, 1) + (1-\alpha) \mathbf{p} A$ * **Computation Methods:** * In principle, the PageRank vector can be computed by solving the following equation using matrix inversion: $$\mathbf{p} = \frac{\alpha}{n} (1, 1, \dots, 1) [I - (1-\alpha) A]^{-1}$$ * However, for the massive web graph, iterative methods like the power method (repeated matrix multiplication) are more practical and efficient. * **Personalized PageRank Computation:** * Replace the uniform restart vector with the personalized distribution $s$. * Computation can still be performed using a random walk, but faster specialized algorithms exist. Bonus implementation ## Reference * Chapter 4 of Blum A, Hopcroft J, Kannan R. Foundations of Data Science. Cambridge University Press; 2020. ## Further Reading ### Markov Chain * Lawler, G. F., & Limic, V. (2010). Random Walk: A Modern Introduction. Cambridge: Cambridge University Press. * David A. Levin, Yuval Peres. Markov Chains and Mixing Times: Second Edition. American Mathematical Society, Providence, RI, 2017. [link](https://pages.uoregon.edu/dlevin/MARKOV/markovmixing.pdf) * Brémaud, P. (2020). Markov Chains: Gibbs Fields, Monte Carlo Simulation and Queues [Texts in Applied Mathematics (TAM, volume 31)]. Springer Cham. https://doi.org/10.1007/978-3-030-45982-6 ### MCMC * Ravenzwaaij, D., Cassey, P., & Brown, S.D. (2018). A simple introduction to Markov Chain Monte–Carlo sampling. Psychon Bull Rev, 25, 143–154. https://doi.org/10.3758/s13423-016-1015-8 * Andrieu, C., de Freitas, N., Doucet, A., & Jordan, M. I. (2003). An Introduction to MCMC for Machine Learning. Machine Learning, 50, 5–43. https://doi.org/10.1023/A:1020281327116 ### PageRank * Austin, D. (2006, December). How Google Finds Your Needle in the Web's Haystack. American Mathematical Society. [link](https://www.ams.org/publicoutreach/feature-column/fcarc-pagerank) * Kamvar, S., Haveliwala, T., Manning, C., & Golub, G. (2003, January). Extrapolation methods for accelerating PageRank computations. In Proceedings of the 12th international conference on World Wide Web (pp. 261-270). * Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web. Stanford InfoLab.