HackMD - Collaborative Markdown Knowledge Base

Probability metric based on comparison of integrals of some function $f$ are of special interest. Consider the integral probability metric between two distributions $p$ and $q$ as $$ D(p,q):=\sup_{f\in\mathcal{F}} \Bigg|\int fdp-\int fdq\Bigg| $$ where $\mathcal{F}$ is any function space. Now, if we restrict ourselves to a function space $\mathcal{F}':=\{f:\|f\|_{\infty}\leq 1\}$, then we boils down to the definition of total variation distance between $p$ and $q$ given by $$ TV(p,q):=\sup_{f\in\mathcal{F}'} \Bigg|\int fdp-\int fdq\Bigg|. $$ Starting with the assumption that if I restrict my function class to $RKHS$ which means $\mathcal{F}=\mathcal{H}$, then also I should be able to write the definition of TV norm between $p$ and $q$ such that $$ TV(p,q):=\sup_{f\in\mathcal{H}'} \Bigg|\int fdp-\int fdq\Bigg|, $$ where we define $\mathcal{H}':=\{f:\|f\|_{\infty}\leq 1\}$ (**My only concern is if my this argument is correct or not?**) Let's say it's correct till this point, then let's move forward. Now, we know $H'$ is a subset of RKHS $\mathcal{H}$. Now, we will try to upper bound the supremum over $\mathcal{H}'$ with supremum over a general class of functions which we call Stein class of functions as $\mathcal{H}_S$. (**Thing to doble check here is if the function class $\mathcal{H}'$ is subset of $\mathcal{H}_S$ or not?**) If yes, then we can write $$ TV(p,q)=\sup_{f\in\mathcal{H}'} \Bigg|\int fdp-\int fdq\Bigg|\leq \sup_{f\in\mathcal{H}_S} \Bigg|\int fdp-\int fdq\Bigg|. $$ $$ TV(p,q)=\sup_{f\in\mathcal{H}'} \Bigg|\int fdp-\int fdq\Bigg|\leq \sup_{A_p f\in\mathcal{H}_S} \Bigg|\int A_p fdp-\int A_p fdq\Bigg|. $$ Now, for Stein class of functions $\mathcal{H}_S$, it holds that $\int fdp=0$. Therefore, we can write $$ TV(p,q)=\sup_{f\in\mathcal{H}'} \Bigg|\int fdp-\int fdq\Bigg|\leq \sup_{f\in\mathcal{H}_S} \Bigg|\int fdp-\int fdq\Bigg|=\sup_{f\in\mathcal{H}_S} \Bigg|\int fdq\Bigg|. $$ Hence, we can write $$ TV(p,q)\leq \sup_{f\in\mathcal{H}_S} \Bigg|\int fdq\Bigg|=: \text{KSD}(q). $$ ----- > Concern on TV upper-bound with KSD : $$ TV(p,q)=\sup_{f\in\mathcal{H}'} \Bigg|\int fdq-\int fdp\Bigg|. $$ Now, since any function that lies in $\mathcal{H}'$ also lies in $\mathcal{H}_S$ and hence we can upper-bound the above expression by $$ TV(p,q)=\sup_{f\in\mathcal{H}'} \Bigg|\int fdq-\int fdp\Bigg| \leq \sup_{f\in\mathcal{H}_S} \Bigg|\int fdq-\int fdp\Bigg| $$ Now, since $A_p$ is a linear operator as in [cite] for any $f \in \mathcal{H}_S$, it also indicates $A_p f \in \mathcal{H}_S$. Since $E_p [A_p(A_p f(x))] = A_p E_p [(A_p f(x))] =0$. Hence, for all $f$ in the Stein class implies, $g = A_p f$ also lies in the Stein class and hence the above TV inequality holds. ----- > To be specific? I think specifically, $\mathrm{argmin}_f \Bigg|\int fdq-\int fdp\Bigg| = \mathrm{argmin}_f \Bigg|\int A_p fdq-\int A_p fdp\Bigg|$ should result in the same $f$. However, since we are upper-bounding the suppremum value shouldn't it be $$ TV(p,q)=\sup_{f\in\mathcal{H}'} \Bigg|\int fdq-\int fdp\Bigg| \leq A_p \sup_{f\in\mathcal{H}_S} \Bigg|\int fdq-\int fdp\Bigg|\\ $$    ---------- > Condition Stein Estimation \begin{align}\label{lips_KSD} \! \! \!\!\!\! U_{i}^k({P}^k(h_i))-U_{i}^k(P^*(h_i)) &\leq HR_{\text{max}} \text{KSD}({P}^k(\cdot|h_i)), \end{align} This is the primary lemma in our analysis where the inequality is with respect to the conditional distribution ${P}^k(\cdot|h_i)$ and $h_i = <s_i,a_i>$ Lets expand upon the RHS ${P}^k(\cdot|h_i)$ which can also be written as ${P}^k(s'|s,a)$ which is basically the distribution over $s'$ given $s,a$. $${P}^k(s'|s,a) = \frac{{P}^k(s',s,a)}{\Phi(s,a)}$$ where $\Phi(s,a)$ is the probability of occurence of the state-action pair $s,a$ and ${P}^k(s',s,a)$ denotes the probability of occurence of $s,a,s'$ tuple. Now, $KSD({P}^k(s'|s,a))$ is what we need to estimate from the RHS of our equation in Lemma 4.1. Let's denote the state-action $s,a$ pair as $h$ for simplicity and we have to estimate $KSD({P}^k(s'|h))$. We know the expression of KSD is dependent on the gradient of the log-probability $KSD({P}^k(s'|h))$ will depend upon $\nabla_{s'} \log {P}^k(s'|h)$. \begin{align} \nabla_{s'} \log {P}^k(s'|h) = \nabla_{s'} \log \frac{{P}^k(s',s,a)}{\Phi(s,a)}\\ \nabla_{s'} \log {P}^k(s'|h) = \nabla_{s'} \log {P}^k(s',h) - \nabla_{s'} \log \Phi(h)\\ \nabla_{s'} \log {P}^k(s'|h) = \nabla_{s'} \log {P}^k(s',h) \end{align} Hence, estimating $KSD({P}^k(s'|h))$ can be replaced by estimating $KSD({P}^k(s',h))$ and the rest of the analysis follows. Note : The logical explaination is missing and is still not clear. Thinking on it. The question of thinning conditional will require multiple samples of $s'$ per $s,a$ but not joint. I am thinking and relating. --- > New proof for Lemma 4.1 Starting with the assumption that if I restrict my function class to $RKHS$ which means $\mathcal{F}=\mathcal{H}$, then also I should be able to write the definition of TV norm between $p$ and $q$ such that $$ TV(p(y|x),q(y|x)):=\sup_{f\in\mathcal{H}'} \Bigg|\int f(y) q(y|x)dy-\int f(y)p(y|x)dy\Bigg|, $$ where we define $\mathcal{H}':=\{f:\|f\|_{\infty}\leq 1\}$. We can also write it like $$ TV(p(y|x),q(y|x)):=\sup_{f\in\mathcal{H}'} \Bigg|\int T_pf(y) q(y|x)dy-\int T_pf(y)p(y|x)dy\Bigg|, $$ where we define the operator $T_p$ as the Stein operator for the joint distribution $p(x,y)$. Utilize the definition of conditinal distribution, we can write $$ TV(p(y|x),q(y|x)):=\sup_{f\in\mathcal{H}'} \Bigg|\int T_pf(y) \frac{q(x,y)}{Z_1}dy-\int T_pf(y)\frac{p(x,y)}{Z_2}dy\Bigg|. $$ Which is equivalent to $$ TV(p(y|x),q(y|x)):=\sup_{f\in\mathcal{H}'} \Bigg|\frac{1}{Z_1}\int T_pf(y) {q(x,y)}dy-\frac{1}{Z_2}\int T_pf(y){p(x,y)}dy\Bigg|. $$ Now, the operator $T_P$ is designed such that the second term in the above expression is zero. Therefore, we are left with the first term, which we actually bound explicitly in Lemma 4.2. **But a big concern here is: if I start with joint from the starting, then what if my f should also be like f(x,y) and also integration should be dxdy, I am not sure** > Added analysis : Stein class subset proof Now, let's assume that we design an operator such that $\int T_pf(y){p(y|x)}dy = 0$, which actually means that we have Stein operator for the conditional distribution. Next, we try to evaluate the following integral for the joint $p(x,y)$ as \begin{align} \int_x \int_y T_pf(x, y) p(x,y) dx dy &= \int_x \Bigg|\int_y T_pf(x, y) p(x,y) dy\Bigg| dx\\ &= \int_x \Bigg|\int_y T_p f(y) p(y|X=x) dy\Bigg| f(x) p(x) dx\\ &= 0, \end{align} where we utilized the minor assumption that $f(x,y)=f(x)f(y)$ which actually holds for the Gaussian kernel. **I AM NOT SURE IN WHAT SENSE IT IS USEFUL** > Explaining Stein class of $P(s'|s,a)$ is a subset of the Stein class of $P(s',s,a)$ First we define an operator $T_p$ which when applied to a function $f(x)$ it gives $$T_p f(x) = s_q(x) f(x) + \nabla_x f(x)$$ Here, when $x \in R^d$, the output of this operator will be in $R^d$ and when $x \in R^{d+d'}$, the output of this operator will be in $R^{d+d'}$ which can be trivially understood from the above equation. Our desired hypothesis is to prove that the Stein class of $P(s'|s,a)$ denoted by $S_c$ is a subset of the Stein class of $P(s',s,a)$ denoted by $S_j$. For an operator $T_p$ if we can show $E_{P(y|x)} T_p f(y) =0$ implies $E_{x,y \sim P(y,x)} T_p g(x,y) =0$ we can claim that $S_c \subset S_j$. Proof : We begin by letting $E_{P(y|x)} T_p f(y) =0$ which implies $\int T_pf(y){p(y|x)}dy = 0$. Now, we try to compute $E_{x,y \sim P(y,x)} T_p g(x,y)$ \begin{align} E_{x,y \sim P(y,x)} T_p g(x,y) &= \int_x \int_y T_p g(x, y) p(x,y)\\ \int_x \int_y T_pf(x, y) p(x,y) dx dy &= \int_x \Bigg|\int_y T_p g(x, y) p(x,y) dy\Bigg| dx\\ &= \int_x \Bigg|\int_y T_p g(y) p(y|X=x) dy\Bigg| g(x) p(x) dx\\ E_{x,y \sim P(y,x)} T_p g(x,y) = 0, \end{align} Hence, we show that $E_{P(y|x)} T_p f(y) =0$ implies $E_{x,y \sim P(y,x)} T_p g(x,y) = 0$ and we can conclude that $S_c \subset S_j$ which tells us that $KSD(P^k (s' |h)) \leq KSD(P^k (s',h))$.