# Learning in DAG
## General Design
We assume we are in the following senario: the state is in a DAG $ (V, E)$ with 1 sinking node in the end, the end goal is to maximize the return from the DAG (assume we don't have discount factor here). In the DAG the edges $E$ are considered fixed.

We are going to layout different settings below. The first setting is that, imagining we are in a setting where:
1. in our modeling we have reward for each node, and everything is explictily observed.
for the above $(V, E)$, then the cumulative return is the sum of all the reward:
$$J^{\pi} = \mathbb{E}_{V \sim \pi}(\sum_{v \in V} r_{v}),$$
we can do this by policy gradient along the depth of the DAG, the depth is defined by the reversed topological sort, so e is of depth 3, b, d are of depth 2, c is of depth 1, and a is of depth 0. (This is important because if you do the topo sort from top-down instead of bottom up it's not friendly to define reward to go).
The motivation of such is that we could do the chain rule for the DAG. Given a DAG (i.e., analogy of a trajectory in the linearized environment), the probability of getting such DAG is:
\begin{align*}
P_\pi(V) &= P(e | d) * P(d | c) * P(b | a) * P(c | a) * P(a) \\
&= \Pi_{\text{depth}=0}^3 P( \{v \in \text{depth}\} | \{v \in \text{depth} - 1\})
\end{align*}
and we have the log-prob:
$$ \log P_\pi(V) = \sum_{\text{depth}=0}^3 \log P_{\pi}( \{v \in \text{depth}\} | \{v \in \text{depth} - 1\})$$
And we can use exactly the same math for the linearized policy gradient. The gradient is therefore:
\begin{align*}
\nabla_{\pi} J^{\pi} &= \nabla_{\pi} \mathbb{E}_{V \sim \pi}(\sum_{v \in V} r_{v}) \\
&= \nabla_{\pi} \int P_\pi(V) \sum_{v \in V} r_{v} dV \\
&= \mathbb{E}_{V \sim \pi}(\nabla_{\pi} \log P_{\pi}(V) \sum_{v \in V} r_{v} ) \\
\end{align*}
We can say that the cumulative return could be also linearized by:
$$
\sum_{v \in V} r_{v} = \sum_{t = 0}^{d} \sum_{v \in \{v | \text{depth} = t\}} r_{v}
$$
Therefore,
\begin{align*}
&\mathbb{E}_{V \sim \pi}(\nabla_{\pi} \log P_{\pi}(V) \sum_{v \in V} r_{v} ) \\
&= \mathbb{E}_{V \sim \pi}(\sum_{t=0}^d \nabla_{\pi}\log P_{\pi}( \{v | \text{depth} = t\} | \{v | \text{depth} = t - 1\}) * \sum_{t = 0}^{d} \sum_{v \in \{v | \text{depth} = t\}} r_{v})\\
&= \mathbb{E}_{V \sim \pi}(\sum_{t=0}^d \sum_{k=0}^d \nabla_{\pi}\log P_{\pi}( \{v | \text{depth} = t\} | \{v | \text{depth} = t - 1\}) * \sum_{v \in \{v | \text{depth} = k\}} r_{v})\\
&= \mathbb{E}_{V \sim \pi}(\sum_{t=0}^d \sum_{k=t}^d \nabla_{\pi}\log P_{\pi}( \{v | \text{depth} = t\} | \{v | \text{depth} = t - 1\}) * \sum_{v \in \{v | \text{depth} = k\}} r_{v})\\
&= \mathbb{E}_{V \sim \pi}(\sum_{t=0}^d \nabla_{\pi}\log P_{\pi}( \{v | \text{depth} = t\} | \{v | \text{depth} = t - 1\}) * \sum_{k=t}^d \sum_{v \in \{v | \text{depth} = k\}} r_{v})\\
\end{align*}
The 1st transition is to pack the summation together and change one index from $t$ to $k$, the 2nd transition says that all the reward of the past (if you are at step $k$) doesn't depend on your current policy at step $k$, therefore the summation range can start from $t$. The 3rd transition repack the summation of index $k$.
Then we do the famous trick, by defining the reward-to-go:
$$
G_{\text{depth} = t} = \sum_{k=t}^d \sum_{v \in \{v | \text{depth} = k\}} r_{v}
$$
It's summing in a backward manner of topological depth defined above up to current depth.
We have:
$$
\mathbb{E}_{V \sim \pi}(\sum_{t=0}^d \nabla_{\pi}\log P_{\pi}( \{v | \text{depth} = t\} | \{v | \text{depth} = t - 1\}) * G_{\text{depth} = t})
$$
## Parsel Specific Design
### Direct Policy Gradient
In some special setting (like Parsel), the only reward we observe from the environment is the sinking node's reward, i.e. $r_e$.
This points to the potential sparse reward design that all the intermediate reward is 0 and only the sinking node's reward is observed and defined. And policy gradient could be directly applied here. Of course, additional reward shaping is always possible.
The hypothesis of PG is that every DAG should be generated on-policy and there's no product rule to generate artificial DAG. For example, if you do 2 rollouts from the model to generat 2 DAGs, you want to be able to construct $2^{\#V}$ DAGs, and always get a reward in the end.
This throws away a lot of the advantage from Parsel.
### Additional Inductive Bias
In some special setting (like Parsel), we want to do more modeling and make use of off-policy reconstructed samples.
#### Design 1
The end reward $r_e$ (to emphasize that we note it $R_e$ as it's the only observable from the environment), although we could get it solely from the state of the node $e$ (i.e., gluing all the sub-function into a big code, we should be able to get the end reward by solely executing this big code), it has a strong correlation with the predecessors' reward (i.e., all the sub-function's correctness). Therefore we could add a hardcoded rule such that $R_e = \min(Q(b), Q(d))$, or in the case of $r \in \{0, 1\}$, $R_e = \Pi_{v \in V^-(e)} Q(v)$ where $V^{-}(e)$ denotes all the predecessor of $e$. In reality, only $r_e$ is observed.
We can use this inductive bias for training, the objective is to align $R_e$ observed from the environment and the parametrized $\Pi_{v \in V^-(e)} Q_{\theta}(v)$.
Therefore, we are in a Q learning regime and throw away the on-policy constraint. Our parametrization is $Q_\theta(b | {v \in V^-\{b\}})$ and $Q_\theta(d | {v \in V^-\{d\}})$, (We note them as $Q_{\theta}(b)$ and $Q_{\theta}(d)$ in the future) to estimate the individual value for each node. Then the loss, operate in the end-reward, could be:
$$
\forall V, \text{minimize} \ L(R_e, \ \Pi_{v \in V^-(e)} Q_\theta(v))
$$
#### Design 2
Another design that Fabian told me is this design. We use Monte-Carlo estimation from the environment to estimate the value per node, so actually there's estimated $\tilde{Q}(b)$ and $\tilde{Q}(d)$, and the . I need to figure out how to do this estimation. Maybe, the estimation is done by $\tilde{Q}(b) = \mathbb{E}_{e \sim \pi} (R_e)$ conditioned on $b$?
So that our loss, operate in a per-node level, is therefore:
$$
\forall V, \forall v \in V, \text{minimize} \ L(\tilde{Q}(v), Q_\theta(v))
$$