Kaporin notes - HackMD

# Kaporin notes ### Essentials from [Kap94] Let $P$ be a preconditioner for $Ax=b$, and $H=P^{-1}$. Define the quantity: $$ K(HA) = \frac1n \text{trace}(HA) \det(HA)^{-1/n} = \frac{\frac1n \sum_{i=1}^n \lambda_i}{(\Pi_{i=1}^n \lambda_i)^{\frac1n }} $$ In [Kap94], it is shown that $$ \| x - x_k\|_{A^T HA} \leq (K(HA)^{n/k} - 1)^{k/2} \| x - x_0\|_{A^T HA}, $$ or equivalently: $$ \| r_k\|_{H} \leq (K(HA)^{n/k} - 1)^{k/2} \| r_0\|_{H}, $$ where $$ r_k = b - A x_k = A(x - x_k), \quad k=1,\ldots,n-1. $$ They also estimate the number $k_H$ of PCG iterations needed reach a residual $\| r_k\|_{H}$ less than $\epsilon>0$: $$ k_H = \text{roundUp}(n\frac{\ln (K(M))}{\ln(2)} + \log_2 \frac1\epsilon). $$ ### Connection with the Bregman divergence When $\text{trace}(HA)=n$ we have $$ \ln(K(HA)^n) = D(A, P) = D(A, H^{-1}). $$Note that this functional is convex in $H$. > Question: $K(HA)^n$ vs $\ln(K(HA)^n)$: are we minimising "the wrong objective"? In our recent work, we optimise $D(P, A)$. What is the connection with Kaporin's functional? First, observe the following identity $$ D(A,P) = \text{trace}(PA^{-1} + AP^{-1}) - D(P,A) - 2n. $$ > NB: I think it is hard to say anything meaningful about $\text{trace}(PA^{-1} + AP^{-1})$. Note in particular that $\text{trace}(HA)=n$ does not tell us much, if anything, about $\text{trace}(PA^{-1})$. Recalling the condition $\text{trace}(HA)=n$, $$ \ln(K(P^{-1}A)^n) = D(A, P) = \text{trace}(PA^{-1}) - D(P,A) - n. $$ In our work, we minimise $D(P,A)$ (or, equivalently, maximise $-D(P,A)$). I am not really sure what is going on here, since our approach seems quite disconnected from minimisation of $\ln(K(P^{-1}A)^n)$. In other words: **What is the connection between minimisation of $$D(P,A)$$ and minimisation of $$\text{trace}(PA^{-1}) - D(P,A) - n? $$** > Question: can we use $D(P,A)$ in our "own version" of PCG convergence theory? ### Information geometry perspective vs Dhillon/Tropp approach In information geometry, if $C$ is a submanifold of $M$, $P\in M$ and $P^\star$ minimises $$ D(P, P^\star), $$ then $P^\star$ is the geodesic projection of $P$ onto $C$. There is no formalism I know of that treats $D(P^\star, P)$ (dual formalism still keeps $P$ in the first argument of the divergence). This appears odd, since $D$ is convex in the first argument! Bregman projections, as seen in the papers of Kulis, Dhillon and Tropp usually look at minimisation of $X \mapsto D(X, Y)$ for some fixed $Y$. What are we missing here? ### Note on truncations We have shown that truncation based on $D(P, A)$ leads to a "sorting rule" according to $$ \frac{1}{1+x} - \ln(\frac{1}{1+x}) - 1. $$ If we were to solve the same problem but where we switch the arguments in the divergence (i.e. $D(A, P)$), this would lead to the sorting rule: $$ 1 + x - \ln(1 + x) - 1 = x - \ln(1 + x). $$ Numerical experiments truncations indicate that PCG convergence a bit faster with the first rule (our current notion of a Bregman truncation) than with the second. What is going on here! I have plotted both curves below: ![image](https://hackmd.io/_uploads/B1djwQMi6.png) ### Possible explanations... We find a solution to $$Ax = b$$by transforming it via a preconditioner $M =P^{-1}$ into: $$MAx = Mb$$. We can rewrite this system by setting $y\gets Ax$, $w\gets Mb$: $$My = z.$$ Let us recall that $$\kappa_2(MA) = \kappa_2(PA^{-1}).$$ Based on this result, and classic PCG convergence analysis, we can precondition $My=z$ by $A$: $$AMy = Az.$$ However, just because $P$ is a good preconditioner for $A$, does not imply that $A$ is a good preconditioner for $P^{-1}$. **This is very easy to verify numerically**. I believe the essence is captured by the following identity: $$ D(A,P) = \text{trace}(PA^{-1} + AP^{-1}) - D(P,A) - 2n, $$ which implies that there is never symmetry in preconditioning (in the sense given above), except when $P = A$, in which case the identity above reads: $$ D(A,P) = - D(P,A) = 0.$$ Let us assume without loss of generality that $$\text{trace}(P^{-1}A) = n$. Then, $$ D(P,A) + D(A,P) = \text{trace}(PA^{-1}) - n. $$ > Question: if minimisation of $D(A,P)$ is consistent with Kaporin's convergence theory, why is it that we see faster PCG convergence for the truncation based on minimisation of $D(P, A)$? ## Notes * Write about Fletcher's BFGS paper from 1991 and $D(A,P)$ vs $D(P,A)$. * Can we solve the low-rank problem using a symmetrised Jensen-Shannon divergence, or $\alpha-\beta$ divergence? * Look at Newton drecrement and connection $\|\cdot\|_{A^{-1}}$ vs $\|\cdot\|_A$. * Write a section about the geometry of PCG