[Thoughts: **delta-cycle**]
Let $U(\mathcal{X})$ and $U(\mathcal{Y})$ be the feature spaces of domain $\mathcal{X}$ and $\mathcal{Y}$, respectively. Assume for the time being the mapping between $\mathcal{X}$ and $\mathcal{Y}$ is bijective. Further, we assume that the feature space $U(\mathcal{X})$ is smoother than $U(\mathcal{Y})$. Our context is that 1) there are many more samples in one domain than the other, and that 2) paired samples are scarce and each sample in the smaller domain has a paired sample in the bigger domain, i.e. $|D_x| >> |D_y| = |D_{x,y}|$. Other than the fact that the graph and text domain are surjective, this fits our current use case.
We want to generate more paired samples. We can perform the following steps of cycle-sampling:
* Take a sample $x$, sample in it's feature space $U(x) + \epsilon_x$, generate $x'$.
* Map $x'$ to $y'$
* Map $y'$ back to $x''$
* We should expect this to hold: $u(x'') \approx u(x) + \epsilon_x$.
The last step gives us a distant measure to train the cycle end to end. When the training converges, then we expand, i.e. $\mathcal{Y} = \mathcal{Y} \cup y'$. This follows the cycle-consistency framework except we are starting and ending in feature space.
[end of thoughts]
[thoughts on incremental graph extraction; this is a placeholder and should be moved elsewhere]
We have seen that a graph $g_t$ of a sentence $t$ can get *very* detailed. The question then becomes if we can only manage to extract a subgraph $g_0(t)$, what's the stopping criteria? A principled way is to use multual information, i.e. we want to ensure $I(t, g_0(t)) > I(t, g(t) - g_0(t)) + \tau$.
[end of thoughts]
Based on last Friday’s discussion, I prepare this draft of the method that analyzes data augmentation via sampling the data points from another space. For example, we have abundant text corpus but few graphs, and the target is to sample new graphs by utilizing the knowledge from text. The method is to convert the graph into natural language and then sample some similar texts around the corresponding text, and then convert it back to the graph structure. This is a simple adaption of the data augmentation with back-translation. However, there is no theoretical analysis of why this strategy can obtain new high-quality samples. And how to control and improve the augmented data?
The basic scenario is that we want to sample some new data points around a known data point. We propose a new sampling method that samples the data point in another data space and then transfers it to this space. For example, we want to get some new data points $x_1, x_2, x_3$ close to the known point $x$, and we first sample in the $Y$ space, $y_1$, $y_2$, $y_3$, and then transfer them into the $X$ space. As a result, we have some new data points by $x_1 = f(y_1)$, $x_2 = f(y_2)$, $x_3 = f(y_3)$.
The motivation is that the $Y$ space may have some good properties, and we can get high-quality data points than direct sampling in X space.
Suppose we have two space X and Y, functions $f^* : X \rightarrow Y$, $g^*: Y \rightarrow X$ that hold the equation $x = g^*(f^*(x))$ and $y = f^*(g^*(y))$, and two learned approximation functions $f, g$, which $f(x) = f^*(x) + e_f, g(y) = g^*(y) + e_g$. The feature transformation $u : X \rightarrow F_X, u^{-1}: F_X \rightarrow X, v: Y \rightarrow F_Y, v^{-1}: F_Y \rightarrow Y$, where $F$ is the feature space. Note that $u^{-1}$ is the inverse function of $u$ and $v^{-1}$ is the inverse function of $v$. We further assume that the $u^{-1}$ is Lipschitz continuous and has the Lipschitz constant $L_{u^{-1}}$ and also the $v^{-1}$ with $L_{v^{-1}}$. $f^*$ and $g^*$ with Lipschitz constants $L_f$ and $L_g$, respectively.
Therefore, we can compare two different sampling strategies, the first one is $\hat{y} = v^{-1}(v(y) + \epsilon)$, and the second one is $\tilde{y} = f(u^{-1}(u(g(y))+\epsilon))$. According to the above assumptions, their distance from the $y$ are, $$||\hat{y} - y|| = ||v^{-1}(v(y) + \epsilon) - v^{-1}(v(y))|| \leq L_{v^{-1}} || \epsilon ||$$,
\begin{align}
||\tilde{y}-y|| &= ||f(u^{-1}(u(g(v))+\epsilon)) - f^*(u^{-1}(u(g^*(y)))) || \\
&\leq ||e_f|| + ||f^*(u^{-1}(u(g(v))+\epsilon)) - f^*(u^{-1}(u(g^*(v))))|| \\
&\leq ||e_f|| + L_f L_{u^{-1}}|| u(g(v))+\epsilon - u(g^*(v)) || \\
&\leq ||e_f|| + L_f L_{u^{-1}} || \epsilon||+ L_f L_{u^{-1}} L_u || e_g || \\
&\leq L_f L_{u^{-1}} || \epsilon || + C
\end{align}
A large coefficient suggests that new samples vary in a large range, which means it's hard to control the quality of augmented data and raise a high risk of invalid points. We can't force $\epsilon$ to a very small value because it will lead to always sample the original point. So, if $L_f L_{u^{-1}}$ is much smaller than $L_{v^{-1}}$, this sampling strategy can provide more controllable and stable samples.
> Qipeng: I am trying to find some results about the variance instead of the range of value, but I didn't have a clear idea.
We can find that the cycle-training can lower the $||e_f||$ and $||e_g||$, and the pre-training on large text corpus can lower than $L_u$ and $L_{u^{-1}}$. Besides, the supervised T2G and cycle-training can lower than $||L_f||$.
Although the form of $||\tilde{y}-y||$ is complicated than $||\hat{y}-y||$, we can think about the question that $v(y)$ and $v^{-1}(y)$ are hard to learn due to the scarcity of data points in Y space and the huge solution (the space of graph may be quite larger than the text space, considering that the difference between generic graphs and spanning trees).
Besides, we can tackle some blocking issues in this sampling strategy.
- Can $y$ help the T2G process that maps the sampled text to the augmented graph $\tilde{y}$?
- Does the distance $||\tilde{y}-y||$ help us control the data quality?
- Can we use the latent variable to reduce the error when transferring modality?