Probability metric based on comparison of integrals of some function $f$ are of special interest. Consider the integral probability metric between two distributions $p$ and $q$ as
$$
D(p,q):=\sup_{f\in\mathcal{F}} \Bigg|\int fdp-\int fdq\Bigg|
$$
where $\mathcal{F}$ is any function space. Now, if we restrict ourselves to a function space $\mathcal{F}':=\{f:\|f\|_{\infty}\leq 1\}$, then we boils down to the definition of total variation distance between $p$ and $q$ given by
$$
TV(p,q):=\sup_{f\in\mathcal{F}'} \Bigg|\int fdp-\int fdq\Bigg|.
$$
Starting with the assumption that if I restrict my function class to $RKHS$ which means $\mathcal{F}=\mathcal{H}$, then also I should be able to write the definition of TV norm between $p$ and $q$ such that
$$
TV(p,q):=\sup_{f\in\mathcal{H}'} \Bigg|\int fdp-\int fdq\Bigg|,
$$
where we define $\mathcal{H}':=\{f:\|f\|_{\infty}\leq 1\}$ (**My only concern is if my this argument is correct or not?**) Let's say it's correct till this point, then let's move forward.
Now, we know $H'$ is a subset of RKHS $\mathcal{H}$. Now, we will try to upper bound the supremum over $\mathcal{H}'$ with supremum over a general class of functions which we call Stein class of functions as $\mathcal{H}_S$. (**Thing to doble check here is if the function class $\mathcal{H}'$ is subset of $\mathcal{H}_S$ or not?**) If yes, then we can write
$$
TV(p,q)=\sup_{f\in\mathcal{H}'} \Bigg|\int fdp-\int fdq\Bigg|\leq \sup_{f\in\mathcal{H}_S} \Bigg|\int fdp-\int fdq\Bigg|.
$$
$$
TV(p,q)=\sup_{f\in\mathcal{H}'} \Bigg|\int fdp-\int fdq\Bigg|\leq \sup_{A_p f\in\mathcal{H}_S} \Bigg|\int A_p fdp-\int A_p fdq\Bigg|.
$$
Now, for Stein class of functions $\mathcal{H}_S$, it holds that $\int fdp=0$. Therefore, we can write
$$
TV(p,q)=\sup_{f\in\mathcal{H}'} \Bigg|\int fdp-\int fdq\Bigg|\leq \sup_{f\in\mathcal{H}_S} \Bigg|\int fdp-\int fdq\Bigg|=\sup_{f\in\mathcal{H}_S} \Bigg|\int fdq\Bigg|.
$$
Hence, we can write
$$
TV(p,q)\leq \sup_{f\in\mathcal{H}_S} \Bigg|\int fdq\Bigg|=: \text{KSD}(q).
$$
-----
> Concern on TV upper-bound with KSD :
$$
TV(p,q)=\sup_{f\in\mathcal{H}'} \Bigg|\int fdq-\int fdp\Bigg|.
$$
Now, since any function that lies in $\mathcal{H}'$ also lies in $\mathcal{H}_S$ and hence we can upper-bound the above expression by
$$
TV(p,q)=\sup_{f\in\mathcal{H}'} \Bigg|\int fdq-\int fdp\Bigg| \leq \sup_{f\in\mathcal{H}_S} \Bigg|\int fdq-\int fdp\Bigg|
$$
Now, since $A_p$ is a linear operator as in [cite] for any $f \in \mathcal{H}_S$, it also indicates $A_p f \in \mathcal{H}_S$. Since $E_p [A_p(A_p f(x))] = A_p E_p [(A_p f(x))] =0$. Hence, for all $f$ in the Stein class implies, $g = A_p f$ also lies in the Stein class and hence the above TV inequality holds.
-----
> To be specific?
I think specifically, $\mathrm{argmin}_f \Bigg|\int fdq-\int fdp\Bigg| = \mathrm{argmin}_f \Bigg|\int A_p fdq-\int A_p fdp\Bigg|$ should result in the same $f$. However, since we are upper-bounding the suppremum value shouldn't it be
$$
TV(p,q)=\sup_{f\in\mathcal{H}'} \Bigg|\int fdq-\int fdp\Bigg| \leq A_p \sup_{f\in\mathcal{H}_S} \Bigg|\int fdq-\int fdp\Bigg|\\
$$
<!-- Now, we need to analyze if the below holds
$$
\sup_{f\in\mathcal{H}_S} \Bigg|\int fdq-\int fdp\Bigg| = \sup_{f\in\mathcal{H}_S} \Bigg|\int A_p fdq-\int A_p fdp\Bigg|
$$
-->
<!-- for Stein class of functions $\mathcal{H}_S$, it holds that $\int fdp=0$. Therefore, we can write
-->
<!-- $$
TV(p,q)=\sup_{f\in\mathcal{H}'} \Bigg|\int fdp-\int fdq\Bigg|\leq \sup_{f\in\mathcal{H}_S} \Bigg|\int fdp-\int fdq\Bigg|=\sup_{f\in\mathcal{H}_S} \Bigg|\int fdq\Bigg|.
$$
Hence, we can write
$$
TV(p,q)\leq \sup_{f\in\mathcal{H}_S} \Bigg|\int fdq\Bigg|=: \text{KSD}(q).
$$ -->
----------
> Condition Stein Estimation
\begin{align}\label{lips_KSD}
\! \! \!\!\!\! U_{i}^k({P}^k(h_i))-U_{i}^k(P^*(h_i))
&\leq HR_{\text{max}} \text{KSD}({P}^k(\cdot|h_i)),
\end{align}
This is the primary lemma in our analysis where the inequality is with respect to the conditional distribution ${P}^k(\cdot|h_i)$ and $h_i = <s_i,a_i>$
Lets expand upon the RHS ${P}^k(\cdot|h_i)$ which can also be written as ${P}^k(s'|s,a)$ which is basically the distribution over $s'$ given $s,a$.
$${P}^k(s'|s,a) = \frac{{P}^k(s',s,a)}{\Phi(s,a)}$$ where $\Phi(s,a)$ is the probability of occurence of the state-action pair $s,a$ and ${P}^k(s',s,a)$ denotes the probability of occurence of $s,a,s'$ tuple.
Now, $KSD({P}^k(s'|s,a))$ is what we need to estimate from the RHS of our equation in Lemma 4.1. Let's denote the state-action $s,a$ pair as $h$ for simplicity and we have to estimate $KSD({P}^k(s'|h))$.
We know the expression of KSD is dependent on the gradient of the log-probability $KSD({P}^k(s'|h))$ will depend upon $\nabla_{s'} \log {P}^k(s'|h)$.
\begin{align}
\nabla_{s'} \log {P}^k(s'|h) = \nabla_{s'} \log \frac{{P}^k(s',s,a)}{\Phi(s,a)}\\
\nabla_{s'} \log {P}^k(s'|h) = \nabla_{s'} \log {P}^k(s',h) - \nabla_{s'} \log \Phi(h)\\
\nabla_{s'} \log {P}^k(s'|h) = \nabla_{s'} \log {P}^k(s',h)
\end{align}
Hence, estimating $KSD({P}^k(s'|h))$ can be replaced by estimating $KSD({P}^k(s',h))$ and the rest of the analysis follows.
Note : The logical explaination is missing and is still not clear. Thinking on it. The question of thinning conditional will require multiple samples of $s'$ per $s,a$ but not joint. I am thinking and relating.
---
> New proof for Lemma 4.1
Starting with the assumption that if I restrict my function class to $RKHS$ which means $\mathcal{F}=\mathcal{H}$, then also I should be able to write the definition of TV norm between $p$ and $q$ such that
$$
TV(p(y|x),q(y|x)):=\sup_{f\in\mathcal{H}'} \Bigg|\int f(y) q(y|x)dy-\int f(y)p(y|x)dy\Bigg|,
$$
where we define $\mathcal{H}':=\{f:\|f\|_{\infty}\leq 1\}$. We can also write it like
$$
TV(p(y|x),q(y|x)):=\sup_{f\in\mathcal{H}'} \Bigg|\int T_pf(y) q(y|x)dy-\int T_pf(y)p(y|x)dy\Bigg|,
$$
where we define the operator $T_p$ as the Stein operator for the joint distribution $p(x,y)$. Utilize the definition of conditinal distribution, we can write
$$
TV(p(y|x),q(y|x)):=\sup_{f\in\mathcal{H}'} \Bigg|\int T_pf(y) \frac{q(x,y)}{Z_1}dy-\int T_pf(y)\frac{p(x,y)}{Z_2}dy\Bigg|.
$$
Which is equivalent to
$$
TV(p(y|x),q(y|x)):=\sup_{f\in\mathcal{H}'} \Bigg|\frac{1}{Z_1}\int T_pf(y) {q(x,y)}dy-\frac{1}{Z_2}\int T_pf(y){p(x,y)}dy\Bigg|.
$$
Now, the operator $T_P$ is designed such that the second term in the above expression is zero. Therefore, we are left with the first term, which we actually bound explicitly in Lemma 4.2.
**But a big concern here is: if I start with joint from the starting, then what if my f should also be like f(x,y) and also integration should be dxdy, I am not sure**
> Added analysis : Stein class subset proof
Now, let's assume that we design an operator such that $\int T_pf(y){p(y|x)}dy = 0$, which actually means that we have Stein operator for the conditional distribution. Next, we try to evaluate the following integral for the joint $p(x,y)$ as
\begin{align}
\int_x \int_y T_pf(x, y) p(x,y) dx dy &= \int_x \Bigg|\int_y T_pf(x, y) p(x,y) dy\Bigg| dx\\
&= \int_x \Bigg|\int_y T_p f(y) p(y|X=x) dy\Bigg| f(x) p(x) dx\\
&= 0,
\end{align}
where we utilized the minor assumption that $f(x,y)=f(x)f(y)$ which actually holds for the Gaussian kernel.
**I AM NOT SURE IN WHAT SENSE IT IS USEFUL**
> Explaining Stein class of $P(s'|s,a)$ is a subset of the Stein class of $P(s',s,a)$
First we define an operator $T_p$ which when applied to a function $f(x)$ it gives
$$T_p f(x) = s_q(x) f(x) + \nabla_x f(x)$$
Here, when $x \in R^d$, the output of this operator will be in $R^d$ and when $x \in R^{d+d'}$, the output of this operator will be in $R^{d+d'}$ which can be trivially understood from the above equation.
Our desired hypothesis is to prove that the Stein class of $P(s'|s,a)$ denoted by $S_c$ is a subset of the Stein class of $P(s',s,a)$ denoted by $S_j$. For an operator $T_p$ if we can show $E_{P(y|x)} T_p f(y) =0$ implies $E_{x,y \sim P(y,x)} T_p g(x,y) =0$ we can claim that $S_c \subset S_j$.
Proof : We begin by letting $E_{P(y|x)} T_p f(y) =0$ which implies $\int T_pf(y){p(y|x)}dy = 0$. Now, we try to compute $E_{x,y \sim P(y,x)} T_p g(x,y)$
\begin{align}
E_{x,y \sim P(y,x)} T_p g(x,y) &= \int_x \int_y T_p g(x, y) p(x,y)\\
\int_x \int_y T_pf(x, y) p(x,y) dx dy &= \int_x \Bigg|\int_y T_p g(x, y) p(x,y) dy\Bigg| dx\\
&= \int_x \Bigg|\int_y T_p g(y) p(y|X=x) dy\Bigg| g(x) p(x) dx\\
E_{x,y \sim P(y,x)} T_p g(x,y) = 0,
\end{align}
Hence, we show that $E_{P(y|x)} T_p f(y) =0$ implies $E_{x,y \sim P(y,x)} T_p g(x,y) = 0$ and we can conclude that $S_c \subset S_j$ which tells us that $KSD(P^k (s' |h)) \leq KSD(P^k (s',h))$.