safe ML - HackMD

# safe ML esearch directions regarding safety and verification of machine learning methods. 1. Properties of Posteriors in Bayesian Modeling. [Summary: when we have to use approximate inference methods on a fixed computational budget, how can we allocate our resources to best preserve desired properties of the exact solutions] bayesian modeling relates observed data $x$ to unknown variables $\theta$ through a joint distribution $p(x,\theta) = p(x|\theta)p(\theta)$. - $p(x|\theta)$ is the likelihood (or observation model) - $p(\theta)$ is a prior that may be used to communicate knowledge about feasible values of the unknown $\theta$. By conditioning on the observations $x$ we obtain the posterior $p(\theta|x)$ by bayes rule $p(a|b)=p(b|a)p(a)/p(b)$. The posterior is high for a realization $\theta=\theta_0$ of the unknown if $\theta_0$ has high density according to the prior and if it explains the data well through the observation model. The process of conditioning on observed variables is called inference. Once a posterior is obtained, we can use it to predict new data $x^\star$. The predictive distribution uses the same form as the likelihood, but uses all possible values of $\theta$, weighing each by its posterior density: $$p(x^\star|x) = \int_\theta p(x^\star|\theta)p(\theta|x)d \theta $$ Posteriors are intractable. By expanding the posterior: $$p(\theta|x)=\frac{p(x|\theta)p(\theta)}{p(x)}=\frac{p(x|\theta)p(\theta)}{\int_\theta p(x,\theta)}$$ Usually $\theta$ is high dimensional. Computing the posterior is generally intractable because of the integral in the $p(x)$ term. Approximate inference methods are necessary. Posterior Properties. Priors and Likelihoods impose particular properties on posteriors that hold when conditioning on any data $x$. But it's not clear, without considering the details of a particular approximation, if these continue to hold. Two examples of properties of true posteriors: - if the prior has $0$ density for some $\theta_0$, so will the posteior - Gaussian Process model used for regression from $x$ to $y$. This model has a parameter called the mean function $m(x)$. The posterior of this model has the property that when a new point $x^\star$ is far from any $x_i$ in the dataset, its prediction $y$ is a random variable whose mean will be close to $m(x^\star)$. For $x^\star$ close to $x_i,x_j$ in the dataset, $y^\star$ will have mean close $y_i,y_j$. Therefore, this model has well-behaved predictions on unseen ranges of input data. This property is nowhere near guaranteed in other models like general neural networks. An example approximation method: Variational inference (VI) approximates the posterior $p(\theta|x)$ using a parameterized distribution $q_\lambda(\theta)$. $\lambda$ can be optimized without computing $p(x)$ so that $q_\lambda(\theta) \to p(\theta|x)$. $\lambda$ is usually optimized using gradient descent (an approximate optimization algorithm) on an objective that measures the difference between $q$ and $p$ called the Kullback Leibler (KL) Divergence. KL(a(z)||b(z)) is defined as: $$ KL(a||b) = \int_z a(z) \log \frac{a(z)}{b(z)} $$ One can minimize $KL(q_\lambda||p)$ or $KL(p||q_\lambda)$ with respect to $\lambda$ but they are not equal. An example of a posterior property that VI breaks: Notice the denominator in the expression for the $KL$. The forward KL $KL(p||q)$ has the property that $q$ will be overly diffuse. This is because $q$ cannot place $\approx 0$ density anywhere $a$ has $>0$ density. Otherwise the $KL$ will be infinite. When $p$ has a small region of $0$ density between two non-zero regions and the form of $q$ is somewhat smooth, $q$ will usually place one non-zero region over that whole area. Therefore the resulting approximate posterior $q$ may have non-zero density where $p$ has $0$ density. This violates the conclusion above that a prior determines $0$ regions of a posterior. This further results in the predictive distribution being able to make predictions with values $\theta_0$ that have $0$ under the prior distribution, thereby breaking domain-specific considerations that may have been put in the place by the modeler. The more common reverse KL $KL(q||p)$ usually has lower variance than the true posterior and may result only learn one of two modes of the true posterior and underepresent the true range of good solutions for $\theta$