What are divergences - appendix

# What are divergences - appendix ## Appendix: How divergences arise (Code for the examples is at https://gist.github.com/martinmodrak/7f2111e8ed4eeac12f87f9bb3dc947a9) TODO: this part still requires additional work, so will maybe not be part of a first "release" TODO: maybe display velocity as primary quantity (because velocity is where the error is) In each iteration of a Hamiltonian Monte Carlo algorithm, we simulate trajectory of a frictionless particle over a surface defined by the negative logarithm of our target density. Since we cannot do this exactly, we approximate the trajectory by taking multiple small discrete steps. Here's how this can look like for a log density of a single univariate normal distribution (which is just a quadratic curve): ![Alt text is TODO](https://roverskypruzkum.skaut.cz/tmp/trajectory.gif) The points represent the location of the particle at each step and the horizontal blue line represents the immediate momentum of the particle. We see that as the particle moves upwards, it gains momentum, then deccelerates as it moves upwards and finally starts moving back. This makes intuitive sense: we want the sampler to be attracted towards regions with high density. Each step is computed by a [leapfrog integrator](https://en.wikipedia.org/wiki/Leapfrog_integration), which assumes that at the scale of the timestep, the log density is reasonably well approximated by a straight line. TODO maybe show single leapfrog step. Or just show the linearization the leapfrog enforces. The initial momentum of the particle is sampled randomly for each iteration and there are some other important details on when the trajectory is terminated and how a next sample is chosen from the points on the trajectory, but those are not relevant for understanding divergences - see [Conceptual Introduction to Hamiltonian Monte Carlo](https://arxiv.org/abs/1701.02434) for a more complete treatment. A big question is: how big of an error do we introduce by treating the log density as roughly linear for each step? Here, Hamiltonian Monte Carlo gives us a great diagnostic. Since the simulation mimics a physical system, we expect the total energy of the system to be conserved. So we can compute the difference between the energy at the start of the trajectory and the current energy and use it to show how "high" (in terms of log density) would the integrator expect the point to be if energy was conserved. We mark those "expected" points by green "X" shapes: ![Alt text is TODO](https://roverskypruzkum.skaut.cz/tmp/trajectory_expected.gif) There are some minor discrepancies, but that should not be too worrisome. In fact, the leapfrog integrator is _symplectic_ which means that those small errors are guaranteed to not accumulate, but roughly cancel out. HMC includes a correction for those errors and as long as they are not substantial, it is able to correct for them seamlessly. It is probably not surprising that if we make the step size too large, the assumption of approximate linearity will break down and we will see large discrepancies - in the plot below, we increased the step size roughly six times: ![Alt text is TODO](https://roverskypruzkum.skaut.cz/tmp/trajectory_large_step.gif) With the previous step size, our particle never moved much past `x = -2`, but now it moves way beyond, which would lead to non-representative samples. But the sampler can detect this, because the integrator now finds itself in a very different energy level than it "expects" - the green crosses are far from the simulated points. **Big discrepances in energy of this type are signalled as divergences**. But we cannot just make the step size very small - with too small step size, the sampler will move slowly and require a lot of computation to provide representative samples of the posterior. During warmup, the sampler tries to find a step size that is large enough to let the sampler efficiently simulate trajectories that get it far away from the initial point while being small enough to avoid large inaccuracies. Such a step size will exist, if the curvature of the log density does not differ much from point to point. The log of the normal density has perfectly even curvature (it's second derivative is constant) and is thus the most well behaved of densities. It is therefore easy to find a step size that works well for a normal density. But what happens if we make the curvature very uneven, e.g. by replacing the left part with a smooth but very rapid downward turn. Here, we are keeping the original (smaller) step size: ![Alt text is TODO](https://roverskypruzkum.skaut.cz/tmp/trajectory_edge.gif) We see that the turn is sharp enough to not appear approximately linear for this step size and a discrepancy in energy is detected. There are however few things to note: a) the highest discrepancy does not necessarily occur close the the problematic point in the goemetry b) when the trajectory diverges, the sampler will pick the next sample among the portion of the trajectory that still looked OK. So although the locations of samples from divergent trajectories often cluster somewhat close to the problematic region, we cannot rely on the location too much. What happens when we instead add a ridge that is small compared to the chosen step size? One possibility is that in some iterations, the sampler will simply miss it: ![Alt text is TODO](https://roverskypruzkum.skaut.cz/tmp/trajectory_wonky_missed.gif) But when the sampler does not step over the feature, it will result in a discrepancy in energy (and thus likely a divergent transition). ![Alt text is TODO](https://roverskypruzkum.skaut.cz/tmp/trajectory_wonky_hit.gif) "Divergence" is a property of the whole trajectory - the discrepancy in energy can behave quite wildly across the trajectory and similarly to the previous case, the biggest discrepancies in energy can be in a different location than where the problematic geometry lies: ![Alt text is TODO](https://roverskypruzkum.skaut.cz/tmp/trajectory_wonky_diverge.gif) Choosing a sufficiently small step size (which can be forced by increasing `adapt_delta`) would let the sampler explore the ridge safely, however it might be to small to also explore the rest of the posterior efficiently. If that happens, Stan will let you know, because overly small step size manifests as reaching the maximum allowed treedepth. During warmup, Stan tries to find a step size that is efficient while avoiding problems. You will encounter divergent transitions after warmup only if such a step size cannot be found - e.g. because different parts of the target require different step sizes (as in this case), or if there is a discontinuity or numerical inaccuracy in the log density which manifests at all step sizes, or for a myriad of other reasons. In any case, unlike many other samplers, HMC will not hide the problems from you.