# Kaporin notes
### Essentials from [Kap94]
Let $P$ be a preconditioner for $Ax=b$, and $H=P^{-1}$. Define
the quantity:
$$
K(HA) = \frac1n \text{trace}(HA) \det(HA)^{-1/n} = \frac{\frac1n \sum_{i=1}^n \lambda_i}{(\Pi_{i=1}^n \lambda_i)^{\frac1n }}
$$
In [Kap94], it is shown that
$$
\| x - x_k\|_{A^T HA} \leq (K(HA)^{n/k} - 1)^{k/2} \| x - x_0\|_{A^T HA},
$$
or equivalently:
$$
\| r_k\|_{H} \leq (K(HA)^{n/k} - 1)^{k/2} \| r_0\|_{H},
$$
where
$$
r_k = b - A x_k = A(x - x_k), \quad k=1,\ldots,n-1.
$$
They also estimate the number $k_H$ of PCG iterations needed reach
a residual $\| r_k\|_{H}$ less than $\epsilon>0$:
$$
k_H = \text{roundUp}(n\frac{\ln (K(M))}{\ln(2)} + \log_2 \frac1\epsilon).
$$
### Connection with the Bregman divergence
When $\text{trace}(HA)=n$ we have
$$
\ln(K(HA)^n) = D(A, P) = D(A, H^{-1}).
$$Note that this functional is convex in $H$.
> Question: $K(HA)^n$ vs $\ln(K(HA)^n)$: are we minimising "the wrong objective"?
In our recent work, we optimise $D(P, A)$. What is the connection with
Kaporin's functional?
First, observe the following identity
$$
D(A,P) = \text{trace}(PA^{-1} + AP^{-1}) - D(P,A) - 2n.
$$
> NB: I think it is hard to say anything meaningful about $\text{trace}(PA^{-1} + AP^{-1})$. Note in particular that $\text{trace}(HA)=n$ does not tell us much, if anything, about $\text{trace}(PA^{-1})$.
Recalling the condition $\text{trace}(HA)=n$,
$$
\ln(K(P^{-1}A)^n) = D(A, P) = \text{trace}(PA^{-1}) - D(P,A) - n.
$$
In our work, we minimise $D(P,A)$ (or, equivalently, maximise $-D(P,A)$). I am not really sure what is going on here, since our
approach seems quite disconnected from minimisation of $\ln(K(P^{-1}A)^n)$. In other words:
**What is the connection between minimisation of
$$D(P,A)$$ and minimisation of
$$\text{trace}(PA^{-1}) - D(P,A) - n?
$$**
> Question: can we use $D(P,A)$ in our "own version" of PCG convergence theory?
### Information geometry perspective vs Dhillon/Tropp approach
In information geometry, if $C$ is a submanifold of $M$, $P\in M$ and
$P^\star$ minimises
$$
D(P, P^\star),
$$
then $P^\star$ is the geodesic projection of $P$ onto $C$. There is no formalism I know of that treats $D(P^\star, P)$ (dual formalism still keeps $P$ in the first argument of the divergence). This appears odd, since $D$ is convex in the first argument!
Bregman projections, as seen in the papers of Kulis, Dhillon and Tropp usually look at minimisation of $X \mapsto D(X, Y)$ for some fixed $Y$. What are we missing here?
### Note on truncations
We have shown that truncation based on $D(P, A)$ leads to a "sorting rule" according to
$$
\frac{1}{1+x} - \ln(\frac{1}{1+x}) - 1.
$$ If we were to solve the same problem but where we switch the arguments in the divergence (i.e. $D(A, P)$), this would lead to the sorting rule:
$$
1 + x - \ln(1 + x) - 1 = x - \ln(1 + x).
$$
Numerical experiments truncations indicate that PCG convergence a bit faster with the first rule (our current notion of a Bregman truncation) than with the second. What is going on here! I have plotted both curves below:

### Possible explanations...
We find a solution to
$$Ax = b$$by transforming it via a preconditioner $M =P^{-1}$ into:
$$MAx = Mb$$. We can rewrite this system by setting $y\gets Ax$, $w\gets Mb$:
$$My = z.$$ Let us recall that
$$\kappa_2(MA) = \kappa_2(PA^{-1}).$$
Based on this result, and classic PCG convergence analysis, we can precondition $My=z$ by $A$:
$$AMy = Az.$$ However, just because $P$ is a good preconditioner for $A$, does not imply that $A$ is a good preconditioner for $P^{-1}$. **This is very easy to verify numerically**. I believe the essence is captured by the following identity:
$$
D(A,P) = \text{trace}(PA^{-1} + AP^{-1}) - D(P,A) - 2n,
$$
which implies that there is never symmetry in preconditioning (in the sense given above), except when $P = A$, in which case the identity above reads:
$$
D(A,P) = - D(P,A) = 0.$$
Let us assume without loss of generality that
$$\text{trace}(P^{-1}A) = n$. Then,
$$
D(P,A) + D(A,P) = \text{trace}(PA^{-1}) - n.
$$
> Question: if minimisation of $D(A,P)$ is consistent with Kaporin's convergence theory, why is it that we see faster PCG convergence for the truncation based on minimisation of $D(P, A)$?
## Notes
* Write about Fletcher's BFGS paper from 1991 and $D(A,P)$ vs $D(P,A)$.
* Can we solve the low-rank problem using a symmetrised Jensen-Shannon divergence, or $\alpha-\beta$ divergence?
* Look at Newton drecrement and connection $\|\cdot\|_{A^{-1}}$ vs $\|\cdot\|_A$.
* Write a section about the geometry of PCG