Workflow Can Often Only Be Determined in the Context of Data Encode

###### tags: `workfield` # Workflow Can Often Only Be Determined in the Context of Data Encode ## Abstract Given the initial data encode and its value $t_{obs}$, repeated two steps (a) modeler specifies increasing model class (b) the best model is found among the given class are guaranteed to converge to the best representation system (by [martingale](https://hackmd.io/WTTU6YiTQ3SKm4qbRMQugA?both#4-Martingale)) with the help of four types of iterative update: parameter, measure of parameter, parameter-data, and replicated data. Based on the feedback from the best representation system, modelers calibrate their intention and update data encode; a closer form to decision. ### 1,2. $MCMC_{\theta}, MCMC_{\mu_{\theta}}$ Self-consistent set for calibrated inference $MCMC_{\theta}$ returns the sample representing $p(\theta|t)$ while $MCMC_{\mu_{\theta}}$ returns sample from the calibrated parameter measure given a model pre-specification $\mu_{\theta|M}$. Both samples has the form of $\theta$ set but the former is the output of a specific $t$ while the latter is the calibrated inference for model+algorithm over a wide variety of possible future datasets, namely self-consistent set. ### 3. $MCMC_{\theta-t}$ Measure and sampler dual update for replication system Given a certain model specification, $\mu_{\theta|M}$ and $DS$ which are dual pair as a measure and function; they are updated alternatively to construct a best replication system that represent data given a class of models. Simulated data measure $\tilde{t}$ which is generated from self-consistent set by detailed balance between data and parameter serves as a diagnostic tool for this convergence in the form of predictive check. Information Bottleneck theory and balance equation that revolve around the markov chain $\theta_{n}-t_{n}-\theta_{n+1}-t_{n+1}$ along with SBC assumption $\mu_{\theta'}=\mu_{\theta}$ guide the convergence of $MCMC_{\mu_{\theta}}$ to its stationary measure. ### 4. Martingale $\text{MCMC-Martingale}_{t}$ Workflow Markov chain from $\mu_M$ calibration When the replicated dataset cannnot recover the given $t_{obs}$ from given $\mu_M$, modeler widens the support of $\mu_M$. This proceeds until ${t_i := E[t|M_i]}$ becomes close enough to $t$ whose convergence could be shown using both Martingale (and Markov chain) theory. Fig. 1 explains how each MCMC and Martingale takes part in the overall update reaching to the best model specification and computation tool that replicates the given $t_{obs}$. Fig.2. describes the former process as data being projected to given model specification. $MCMC_{\mu_{\theta}}$ and $MCMC_t$ is used as a projection tool. This formulation teaches us to encode $t$ closest to our final interest as the workflow aims to best replication of the given format of information summary. ![](https://i.imgur.com/WK6jzFH.png) *Fig 1. Bottomup view: Evolving parameter, measure, and replicated data with MCMC.* ![](https://i.imgur.com/rE1Mag0.png) *Fig 2. TopBottom view: how $t$ is selected and how replication systems are constructed.* ### Terminology (related to generated model) - $t$: information included after encode - $t_{n} :\forall \text{info} / \text{encode}_n \rightarrow t_{opt} :\forall \text{info} / \text{interest}$ (X/M as a quotient group) - $MCMC_{\mu_{\theta}}$: iterating generate-inference multiple times until $\mu_{\theta'}=\mu_{\theta}$ - $MCMC_{t}$: adjusting the support of $\mu_M$ until $\mu_{t'}=\mu_{t}$ - $p(\tilde{t}|t)$: predictive check - $\mu_M$: model specification, simplest version of which model to use without any quantification - self consistent set: samples from stationary measure of $MCMC_{\mu_{\theta}}$; calibrated inference for model specification+computation tool over a wide variety of possible future datasets - $\mu_{\theta|M}$: parameter measure given model specification - similar to prior which can be written as $p(\theta)$ and determined by included predictors, density for prior (gamma, inverse gamma) and likelihood (poisson, negative binomial), prior parameter values etc. - funtion $DS|M$ : $\mu_\theta \rightarrow {\theta_1,..,\theta_a, t_1,..,t_b}$ // $DS$: $\mu_M \rightarrow {\theta_1,..,\theta_a, t_1,..,t_b}$ - aka. model specification+computation tool, model+algorithm, update of either model and computation algorithms are all included in this step - output $\tilde{t}:={t_1,..,t_b}$ is used for predictive check and determines further model development - decomposed as: $(\mu(\Theta), t) -Density\rightarrow p(\theta, t) -Sampler\rightarrow \theta_{1..a},t_{1..b}$ - $p(\theta|t)$: model // $p'(t|\theta)$: algorithm (p' for inclusion of approximate computation tool) ## 1. $MCMC_\theta$ parameter sampler given $\mu_{\theta|M}, \mu_M, t$ HMC, Gibbs etc as a component of DS. $\mu_{\theta|M}$ is given as $MCMC_\theta$ requires posterior space which is built upon a specific parameter measure. ## 2. $MCMC_{\mu_{\theta}}$ measure sampler given $DS, \mu_M, t$ ### 1. Markov chain $\theta-\theta'$ for self-consistent set Stationary measure $\mu$ is defined as $\mu(y) = \sum_{x} \mu(x) p(x, y)$ is another interpretation of self-consistency in Simulation-based calibration (SBC): $p(\theta')= \int dy p(\theta)p(y|\theta)$. In fact ${\{\theta}\}_{1..N}$ constructed from description in Fig. 4, is a markov chain with transition probability $P(\theta, \theta') := \int dt p'(\theta'|t)p(t|\theta)$. The sketch of proof for its convergence to stationary measure is given in [appendix A](https://hackmd.io/EBKHRcNPSBuzljKtUkYSjg#A-Second-MCMC-Convergence). The **self-consistency set** defined as samples from the stationary measure $\mu_{\theta}$, is the outcome of the first $\theta - \theta$ markov chain. ![](https://i.imgur.com/FvMg0IN.png) *Fig.3 Simulation-based calibration constructs Markov chain and returns $\tilde{\theta}, \tilde{t}$ upon convergence.* ![](https://i.imgur.com/w2o4nbc.png) *Fig.4 Detailed balance between data, parameter, and hyperparameter.* Only when $\theta-\theta'$ chain has converged, further markov chain between the data and parameter can be constructed which satisfies detailed balance $p(\theta)p(t|\theta) = p(\theta|t)p(t)$. From this second markov chain, data samples that satisfy a detailed balance could be generated from self-consisent set (Fig. 3, step 7). The detailed balance $p(\phi)p(\phi|\theta) = p(\theta|\phi)p(\phi)$ could be used to extend the markov chain further (Fig.4). Note that the recommended order of determination is from right to left i.e. $p(\theta)$ determined from $p(t), p(t|\theta), p(\theta|t)$, not $p(t)$ from $p(\theta), p(t|\theta), p(\theta|t)$. See [Discussion](###Discussion) for details. ### 2. Convergence diagnostics ![](https://i.imgur.com/CnpINvd.png) *Fig.5 comparison of a posterior distribution to the assumed true parameter value and expected symmetry of self-consistent set* Preservation of inflow and outflow at stationary distribution, $\mu_\theta = \mu_{\theta'}$, which is known as SBC assumption is a neceissity condition for convergence. SBC rank histogram and ECDF ([Säilynoja21](https://arxiv.org/abs/2103.10522)) compare the two distribution based on this. Viewing $\theta'$ and $\theta$ on the same joint $\theta, y, \theta'$ space, a new random variable $\theta'- \theta$ could be inspected based on the defintion of symmetric random variable: $X\stackrel{D}{=}-X$. Distribution of ${\theta'_{nm} -\theta_{n}}$ and ${\theta_{n} - \theta'_{nm}}$ can be compared as an extension of SBC rank statistics: $\sum_{m=1}^{m=M} I({{\theta'}_{nm} < {\theta}_n})$. Fig.5 shows problems with checking model with a few true prior samples pointwise ([Gelman20](https://arxiv.org/abs/2011.01808)). Expected symmetry, $p(\theta,\theta') = p(\theta', \theta)$, (proof in Appendix) at stationary measure could provide insight to further diagnostic design. The speed of convergence could be quantified by counting the frequency of instances where two independent chains $X_n, Y_n$ hit the same binned region based on the following equation from MCMC convergence proof using coupling theory: $\sum_{y}\left|P\left(X_{n}=y\right)-P\left(Y_{n}=y\right)\right| \leq 2 P(T>n)$ ### Comparing two MCMC $MCMC_{\mu_{\theta}}$ VS $MCMC_\theta$ - space: $\theta$ VS $\theta$ - target: self-consistent set VS typical set - calibrated inference for model+algorithm over a wide variety of possible future datasets VS output of a specific $t$ Multiple methods as follows have been developed to accelerate and diagnose $MCMC_\theta$ as well as to transform for favorable estimation. This could be equally appled to $MCMC_{\mu_{\theta}}$ and left for further study. This is possible by designing a well-behaving kernel $p(\theta, \theta')$; for instance, to resolve the high computation of SBC, bootstrap along with infinitesimal jacknife. -- accelerate: parallel, coupling -- diagnose: Rhat -- variance reduction -- bias correction: coupled kernel [Xu2021](http://proceedings.mlr.press/v130/xu21i/xu21i.pdf) ## 3. Finding the best projection to given a model class specified by the modeler $MCMC_{\theta-t}$: $\mu_{\theta|M}$ and $DS|M$ dual update of given $\mu_M, t$ for replication system Let's assume we have a wide enough model space to represent $t$ i.e. no problem with $\mu_M$. $t_{obs} \sim E[t|M]$ will hold and now the aim is to find a best replicating system $\mu_{\theta|M}$ and $DS|M$ that simulates $\tilde{t}$ similar to $E[t|M] = t_{obs}$. In sample-based analysis, replication is crucial as even if the support for $\mu_M$ is wide enough, if our computational tools inaccurate, it impossible to replicate $t$. In actual workflow, density, $\mu_{\theta|M}$, approximate computation are tested, compared, and updated which is an attempt to find the best possible explanation of $t$ within the range of given model specification hyperplane. This manual update could ultimately reach the optimal replication system, $Proj_M(t) = E[t|M]$ but it would not hurt to make this update more systematic. Fig.6 shows iterative update of two components, $\mu_{\theta|M}, DS|M$ and its convergence to $E[t|M] = t_{obs}$. To recap, measure $\mu_{\theta|M}$ is a parameter measure given model specification and funtion $DS|M$ is $\mu_\theta \rightarrow {\theta_1,..,\theta_a, t_1,..,t_b}$. For simplicity, dependence on model specification and data encode are dropped from the notation. Sequential evolvement $...\theta_n - t_n - \theta_{n+1}...$ is how I view this where the balance of the previous stationary measure, $\mu_{\theta_n}$, break as DS ($\theta_n-t_n-\theta_{n+1}$) evloves. Then the new balance $\mu_{\theta_{n+1}}$ is achieved on the new field, $\theta_n-\theta_{n+1}$, with $MCMC_\mu$. ![](https://i.imgur.com/5LnNh1O.png) *Fig.6 Dual update of $\mu_\theta, DS$ resulting in $t_n$ as their operation.* Understanding their respective roles is important for further explanation on algorithm. Measure $\mu$ is updated only with $MCMC_{\mu_{\theta}}$ while function DS could be affected from parameter update from $MCMC_\theta$, sampling (add predictor, poisson to NB likelihood) and posterior density (MCMC to ADVI) update, and hyperparameter update (adapt metric or leapfrog step). Measure and function are dual in a sense that there exist unique corresponding manipulation in their dual pair that returns the same operational output. What I am proposing is alternating the update of each to drive the output, $t_n$, to its optimal; which seems to have connection with EM algorithm. In simple terms with another known dual, function f and functional <g,.>, I am trying to achieve the convergence to t via i and ii iteration. i) $<g , f>_{n} = t_n \rightarrow t$ : update $<>_n$ with fixed $f$ ii) $<g , f_{n}>= t_n \rightarrow t$ : update $f_n$ with fixed $<>$ In the following self-consistent equality, $[p(t|\theta), p(\theta|t), p(t)]$ and $[p(\theta), p(\theta'|\theta)]$ are dual: ![](https://i.imgur.com/xujBtNN.png) The convergence proof for the following update is suggested in [Tibsy99](https://arxiv.org/abs/physics/0004057) in the name of information bottleneck iterative algorithm which aims to find the optimal encoding (T) of X that balances the efficiency and representability of the output, Y. In here, $\theta, \theta', t$ plays the role of $X, Y, T$. $$ p_{n}(t \mid \theta)=\frac{p_{n}(t)}{Z_{n}(\theta, \beta)} \exp (-\beta d(\theta, t)) \\ p_{n+1}(t)= \Sigma_{\theta} p(\theta) p_{n}(t|\theta) \\ p_{n+1}(\theta'|t)= \Sigma_{\theta}p(\theta'|\theta) p_{n}(\theta|t) $$ #### $\mu_{\theta|M}$ update Updating the measure corresponds to $MCMC_{\mu_{\theta}}$'s fourth and fifth step in fig.3 could be formulated as: $$ p(t)= \Sigma_{\theta} p_n(\theta) p(t|\theta) \\ p(\theta'|t)= \Sigma_{\theta}p_n(\theta'|\theta) p(\theta|t) $$ #### $DS$ update Updating the function could be written as follows by plugging in the measure update output, $p(\theta'|\theta) = 1$ and $p(\theta) = \mu_\theta$: $$ p_{n}(t \mid \theta)=\frac{p_{n}(t)}{Z_{n}(\theta, \beta)} \exp (-\beta d(\theta, t)) \\ p_{n+1}(t)= \Sigma_{\theta} \mu_\theta p_{n}(t|\theta) \\ p_{n+1}(\theta'|t)= \Sigma_{\theta} p_{n}(\theta|t) $$ Unlike the original proof, $p(\theta)$ is not fixed but rather readjusted to $\mu_{\theta_n}$ at every step as the $DS$ update leads to different transitioanl probability $\int dt p_n'(\theta'|t)p_n(t|\theta) \neq \int dt p_{n+1}'(\theta'|t)p_{n+1}(t|\theta)$ and therefore different stationary measure. Hopefully this minor deviation from the original would not affect the overall convergence greatly. ### 2. Convergence diagnostic To test the replication quality as in 'Test' column of fig.6, $\tilde{t}_n$ is simulated by pushing forward the last self-consistent set along the second equation from [DS update](####$DS$-update). Iterate until $\tilde{t}_n$ become close enough to $t_{obs}$. However, if $p(\tilde{t}|t_{obs})$ is not good enough even after the convergence i.e. $d(\tilde{t}, t_{obs}) >\epsilon$ and non-decreasing, this implies $\mu_M$ is problematic (recall the first assumption). In this case, $\mu_M$ should be updated. ## 4. Modeler specifies increasing model class : $MCMC_t$ data sampler given $t$ ### 1. Workflow Markov chain from $\mu_M$ calibration When modelers increasingly expand their model, ${\{t}\}_{1..N}:= E[t|M_i]$ will ultimately converge to $t$ using martingale convergence theorem under mild conditions. Moreover, under a reasonable kernel construction, it also forms a markov chain which is useful to show convergence. ### 2. Convergence diagnostics Graphical and numerical predictive checks $p(t'|t)$ discussed in [Garby19](https://rss.onlinelibrary.wiley.com/doi/full/10.1111/rssa.12378) and Gelman13 would be the convergence diagnostics. ## 5. Experiment [plot link](https://hackmd.io/tZPcC4MAR8Wr8Gd32cwc3Q) ### $MCMC_{\mu_{\theta}}$ - convergence measure: $\operatorname{Var}(\theta_{post})/\operatorname{Var}(\theta_{prior}) \in [0.8,1.2] \; \forall$ parameters - linear regression $y = \alpha + \beta * X$ -- tighter initial priors $p_0(\theta)$ = faster convergence (17, 12 iterations for $N(0,5^2) N(0,1^2)$) ![](https://i.imgur.com/1lSgOVi.png) *SCS_eightschool* - hierarchical model (eight school) with centered/noncentered parameterization (cp results only 0513) -- much harder to converge with greater number of parameter: took 222 ![](https://i.imgur.com/d4LhBgK.png) *SCS_eightschool* ## 6. Discussion ### Data encode as the center of EDA This paper explains how workflow aims to best replicate the given format of information summary. However, as modelers have nonexact interest or specific form of data might be unavailable, $t_{opt} :=\forall \text{info} / \text{interest}$, quotient of all information wrt our interest equivalence several trial and errors are needed. Though $t$ was assumed to be given in this paper, its update could itself form a higher level MCMC where modelers act as kernels. Data $t$ is much closer to users than parameter measure $\mu_\theta$ and parameter $\theta$ and therfore much intuitive and speedy interaction are expected when workflow interface centers around it. [Afrabandpey20](https://arxiv.org/abs/1910.09358) relates interpretability to users’ preferences and utility function rather than prior for instance. Therefore, from fig.1, the workflow should be desinged from right to left (data-> parameter -> hyperparameter), not the other way around. The following section introduces one such example. ### Self-consistent set instead of prior Pointwise model checking and pairwise model comparison could burden users. Modeler with the knowledge of true parameter range and searching for valid computation could check the overlap between their range and self-constent set determined with SBC iteration. This simple change of view has a similar effect of recasting integer optimization to solve with convex optimization tool. This is in line with [data-foucsed EDA](#Data-encode-as-the-center-of-EDA) where users start from applying model+algorithm to their _data_ instead of starting from a relatively unkown and latent parameter space. Moreover, suggested approach could prevent the efforts allocated to prior predictive checks. Prior is known to serve two purposes: additional information equivalent to data or marginalized posterior wrt data. With no outside data require during SBC, SCS watch. Also, between prior's two role, SCS addresses the second. ### SBC would eventually fail: how much? Current problems of SBC (without iteration) were 1) nothing gives a perfect SBC test result (perfect prior and posterior distributional equivalence) and therefore SBC has to take the form of a hypothesis test 2) no metric for how much the model+algorithm (=density+sampler) is off With a self-consistent set (SCS), the two problems could be solved by calculating the overlap between the user's own theta set (whether it is from domain knowledge or the inference from a dataset) and SCS. ### SCS instead of model+algorithm choice As long as samples from the conditional space ($\mu_{\theta|M}$ from [terminology](#Terminology)) is the same, the underlying components (model + algorithm) might not be important. In other words, let's say comibation1 is {$y = \alpha + \beta*x$ + HMC} and combination2 is {$y = \beta*x$ + Laplace approximation}. Using [trans-dimensional MCMC notation](https://www.stat.ubc.ca/~bouchard/courses/stat520-sp2014-15/lecture/2015/03/08/notes-lecture5.html) which addresses difference in latent space dimension by product space construction, SCS1= $[-1,+1]\times[-1, +1]\times[0]$ and SCS2 $[0]\times[0]\times[-1\times 1]$ include the user's targeted inference region users (from [experiment](#5-Experiment) regression example), is within g the SCS Likewise, SCS1 which is ($\alpha, \beta) \in [-1,+1]\times[-1, +1]$ is consistent with our domain knowledge on the model's intercept and slope. ## 7. Further development - interpretion with online decision and information-bottleneck theory - suggest using self-consistent set as prior and apply this to Stan and posteriordb. - [enhancing $MCMC_{\mu_\theta}$ kernel for faster convergence or favorable estimation](###Comparing-two-MCMC) - user defined $t$ - specify $MCMC_t$ convergence further with $p(t'|t)$ - [understand Information Bottleneck Theory's application to deep learning](https://www.quantamagazine.org/new-theory-cracks-open-the-black-box-of-deep-learning-20170921/#comments) - explore the connection with Self-Consistent Field (SCF) Method in quantum chemistry which express solutions with fixed basis function, $\psi = c_1\phi_1 + ... + c_n\phi_n$, then iterate assign-update $f_1(c_1,..,c_n) = E_1[c_1]$ $f_2(c_1,..,c_n) = E_2[c_2]$ > wavefunction is too complex to be found directly, but can be approximated by a simpler wavefunction. This then enables the electronic Schrödinger equation to be solved numerically. The self-consistent field method is an iterative method that involves selecting an approximate Hamiltonian, solving the Schrödinger equation to obtain a more accurate set of orbitals, and then solving the Schrödinger equation again until the results converge. ## Appendix ### A. $MCMC_{\mu_{\theta}}$ Convergence Update of $\theta$ could be described as a discrete-time Markov chain on a compact set $\Theta$ with its transition probability defined from the generate-inference process: $y_{n} \sim p\left(y \mid \theta_{n}\right), \theta_{n+1} \sim \alpha\left(\theta \mid y_{n}\right)$ Inference step with Markov kernel, ie.$p(\theta, \theta')=\left(\int_{y} \alpha(\theta' \mid y) p(y \mid \theta) d y\right)$ defines a probability density for all $\theta$ which leads to the existence of invariant measure $\mu(\Theta)$ and stationary distribution, $p(\theta)$. Condition for changing the stationary measure to distribution and the level of requirement for convergence between stationary and reversible remains to be decided. $$\sum_{\theta} \mu(\theta) p(\theta, \theta')=\mu(\theta') - \text{stationary} \\ \mu(\theta) p(\theta, \theta')=\mu(\theta') p(\theta', \theta) \quad \text { for all } \theta, \theta' - \text{reversible} $$ ### B. Symmetry of $p(\theta, \theta')$ Using the SBC assumption $p(\theta) = p(\theta')$ which states perfect generate-inference steps would lead to same distribution for prior and posterior samples, the following holds. $$ \begin{array}{l} p\left(\theta^{\prime}, \theta\right) \\ =p(\theta) \int \mathrm{d} y p\left(\theta^{\prime} \mid y\right) p(y \mid \theta) \\ =p\left(\theta^{\prime}\right) \int \mathrm{d} y p\left(\theta^{\prime} \mid y\right) p(y \mid \theta) \\ =p\left(\theta, \theta^{\prime}\right) \end{array} $$ ### C. Generative model - function $DS|M: \mu_{\theta|M} \rightarrow {\theta_1,..,\theta_a, y_1,..,y_b}$ (cf. $DS: \mu_M \rightarrow {\theta_1,..,\theta_a, y_1,..,y_b}$) - decomposed as: $\mu_{\theta|M}-Density\rightarrow p(\theta, y) -Sampler\rightarrow \theta_{1..a},y_{1..b}$ - aka. model specification+computation tool, model+algorithm, update of either model or computation algorithm are included in this step - output $\tilde{y}:={y_1,..,y_b}$ is used for predictive check and determines further model development - in/output format is user-centered as most priors are distributional form while posteriors are samples (some exceptions could be conjugate models and prior pointwise SBC): the best form between the given, $DS|M: \theta_1,..,\theta_a \rightarrow {\theta_1',..,\theta_b', y_1,..,y_c}$, and $DS|M: \mu_{\theta|M} \rightarrow {\mu_{\theta|M}, \mu_{y|M}}$ needs to be decided - $\mu_M$: model space measure that pre-specifies model qualitatively (eg.linear models) - $\mu_{\theta|M}$: parameter measure given pre-specified model - similar to prior which can be written as p(\theta) and determined by included predictors, density for prior (gamma, inverse gamma) and likelihood (Poisson, negative binomial), prior parameter values etc.