## The maginalization is incorrect
**Q1: According to your model decomposition, $p(G_{1:n},\sigma_{1:n})=p(G_{1:n-1}|\sigma_{1:n})p(\sigma_{1:n}|G_n)p(G_n)$. Since $p(G_{1:n-1}|\sigma_{1:n})$ is implemented by a GNN, which does not use node orders as the input, $p(G_{1:n-1}|\sigma_{1:n})$ is the same for any $\sigma_{1:n}$. Then by marginalizing out $\sigma_{1:n}$, we have $p(G_{1:n})=p(G_{1:n-1}|\sigma_{1:n})p(G_n)$. for an arbitrary $\sigma_{1:n}$. If you decide to use an uniform distribution $p(\sigma_{1:n}|G_n)=\frac{1}{n!}$ , then equation 8 should be $p(G_{1:n})=(n!)p(G_{1:n},\sigma_{1:n})$.**
This is incorrect.
$p(G_{1:n-1}|\sigma_{1:n})$ has the same nonzero value only for $\sigma_{1:n} \in \Pi(G_{1:n})$, where $\Pi(G_{1:n})$ is the set of all node orderings that give the same graph sequence $G_{1:n}$.
However,
the support of the generation ordering $p(\sigma_{1:n}|G_n)$ is over all $\sigma_{1:n}$. When $\sigma_{1:n} \notin \Pi(G_{1:n})$, the model cannot generate the graph sequence $G_{1:n}$ with the given node ordering and thus $p(G_{1:n-1}|\sigma_{1:n})=0$. The correct way to marginalize over $\sigma_{1:n}$ is thus:
\begin{align}
&\sum_{\sigma_{1:n}}p(G_{1:n-1}|\sigma_{1:n})p(\sigma_{1:n}|G_n)p(G_n)\\
&=\sum_{\sigma_{1:n}\in \Pi(G_{1:n})} p(G_{1:n-1}|\sigma_{1:n})p(\sigma_{1:n}|G_n)p(G_n)+ \sum_{\sigma_{1:n}\notin \Pi(G_{1:n})} 0\cdot p(\sigma_{1:n}|G_n)p(G_n) \\
&= \sum_{\sigma_{1:n}\in \Pi(G_{1:n})} p(G_{1:n-1}|\sigma_{1:n})p(\sigma_{1:n}|G_n)p(G_n) \\
&=p(G_{1:n-1}|\sigma_{1:n})\sum_{\sigma_{1:n}\in \Pi(G_{1:n})} p(\sigma_{1:n}|G_n)p(G_n).
\end{align}
Since the support of $p(\sigma_{1:n}|G_n)$ is over all possible $\sigma_{1:n}$, $\sum_{\sigma_{1:n}\in \Pi(G_{1:n})} p(\sigma_{1:n}|G_n)\not=1$. Therefore, you cannot marginalize out $\sigma_{1:n}$ from $p(G_{1:n},\sigma_{1:n})=p(G_{1:n},\sigma_{1:n})p(\sigma_{1:n}|G_n)p(G_n)$ to obtain $p(G_{1:n})=p(G_{1:n-1}|\sigma_{1:n})p(G_n)$.
How to compute $\sum_{\sigma_{1:n}\in \Pi(G_{1:n})} p(\sigma_{1:n}|G_n)$? This is a non-trivial problem since the generation ordering $p(\sigma_{1:n}|G_n)$ is not uniform and even has no closed-form expression. Instead, $p(\sigma_{1:n}|G_n)$ is separately modeled and set as the distribution of the diffusion ordering $q_{\phi}(\sigma_{1:n}|G_0)$ during training to achieve the optimal point of VLB (Full illustration can be found in our previous response https://openreview.net/forum?id=98J48HZXxd5¬eId=jdqpPlL6yG).
**Q2: However, there is one question not answered: why Chen et al. also use the same equation as (8)? Is their equation wrong as well? After I studied their paper, I feel that your $p(G_{1:n},\sigma_{1:n})$ is different from theirs. If I assume your distribution $p(\sigma_{1:n}|G_n)$ is uniform, then $p(G_{1:n},\sigma_{1:n})$ is the same for all $\sigma_{1:n}$, but their $p(G_{1:n},\sigma_{1:n})$ does not have the behavior.**
Our $p(G_{1:n},\sigma_{1:n})$ is exactly the same as Chen et al. As we illustrate in Q1, the marginalization is incorrect and you cannot assume $p(\sigma_{1:n}|G_n)$ is uniform. Therefore, our model does not have the behavior "$p(G_{1:n},\sigma_{1:n})$ is the same for all $\sigma_{1:n}$" either.
**Q3: There might be a misconception that a different node order should lead to a different generation sequence and thus the probability should be different. ...In these two examples, we will have $p(G_{1:n},\sigma_{1:n})$ exactly the same with either or $\sigma_{1:n}=(2,3,1,4)$ or $\sigma_{1:n}=(3,2,1,4)$.**
We did consider that different node orderings can lead to the same graph sequence and the same probability $p(G_{1:n},\sigma_{1:n})$. That is exactly why our derivation involves the graph automorphism $\Pi(G_{1:n})$, i.e., the set of all node orderings that give the same graph sequence $G_{1:n}$.
As we illustrate in our previous response (https://openreview.net/forum?id=98J48HZXxd5¬eId=-36LwSy98IA), the main difference of our method compared with OM is how to treat $p(G_{1:n},\sigma_{1:n})$. OM factorizes $p(G_{1:n},\sigma_{1:n})$ as
$$p(G_{1:n},\sigma_{1:n})=p(G_{1:n})p(\sigma_{1:n}|G_{1:n}).$$
As $p_\theta\left(\sigma_{1: n} \mid G_{1: n} \right)$ is a uniform distribution, OM inevitably needs to involve $|\mathit{\Pi}(G_{1:n})|$ in their VLB.
In contrast, we factorize $p_\theta(G_{1: n}, \sigma_{1:n})$ as
$$p_\theta\left(G_{1: n}, \sigma_{1: n}\right) = p_\theta\left(G_{1: n-1} \mid \sigma_{1: n}\right) p\left(\sigma_{1: n} \mid G_n\right) p\left(G_n\right),$$
which has two benefits:
(1) The challenge of modeling the node ordering variable is now pushed into the term $p\left(\sigma_{1: n} \mid G_n\right)$. This is a key reason that we do not have the $\left|\Pi\left(G_{1: n}\right)\right|$ term in our VLB, instead we have a KL term. Fortunately, we do not need to explicity model this distribution and simply set it as the diffusion ordering $q(\sigma_{1:n}|G_n)$ during training.
(2) Our generator network $p_{\theta}(G_{0:n-1}|G_n,\sigma_{1:n})$ models the graph generation conditioned on a fixed node ordering. In contrast, OM models $p(G_{1:n})$ with no exact node ordering information.
**I have done a lot of work trying to convert your notation to the adjacency matrix. Your generation model in Section 3.2 essentially generates an adjacency matrix with a particular row order. I think a good practice is to follow notations established by previous work (GraphRNN, OM, and others). If you do so, then your derivation error becomes obvious.**
Our generation model is not equal to generating an adjacency matrix with a particular row order, namely $p(G_{1:n}|\sigma_{1:n}) \not= p(A_{1:n})$. This is because the row order of an adjacency matrix $A_{1:n}$ does not contain the exact node ordering information since multiple node orders $\sigma_{1:n}$ can lead to the same adjacency matrix $A_{1:n}$. While in our model, we use the exact node ordering information for generation. This is why we did not follow notations of previous works to use adjacency matrix since it will loose the exact node ordering information.
Why can we use the exact node ordering information while previous work cannot? This is because, in our framework, the node ordering $\sigma_{1:n}$ is sampled in the forward diffusion process which essentially assigns a unique ID to each node of $G_0$. In the generation procedure, our starting point is the masked graph $G_n$ with $n$ nodes; our generator network sequentially denoises the node $v_{\sigma_t}$ in $G_n$ in the reverse order of the diffusion ordering during training. In contrast, the previous work can only sequentially grow the adjacency matrix which loose the exact node ordering information since they do not have the extra diffusion process as in our framework.
Though both $p(G_{1:n}|\sigma_{1:n})$ and $p(A_{1:n})$ are implemented by a GNN, they have different physical meanings. The resulted training objectives are different and thus the computed values are different.
**For example, you mention that $p(G_{1:n-1}|\sigma_{1:n})=0$ when $\sigma_{1:n} \notin \Pi(G_{1:n})$. However, with any node order $\sigma$, the generative model in section 3.2 can always generate a graph sequence $G_{1:n-1}$ but with nodes labeled according to .**
Yes, every node ordering $\sigma_{1:n}$ can lead to a corresponding graph sequence $G_{1:n}$.
However, during the marginalization, you need to integrate over all $\sigma_{1:n}$ for particular graph sequence $G_{1:n}$. For $\sigma_{1:n}\notin \Pi(G_{1:n})$,
$p(G_{1:n-1}|\sigma_{1:n})=0$ since this node ordering cannot lead to this particular graph sequence $G_{1:n}$
**Furthermore, you may need to explain this observation: if we use $p(G_{1:n})=(n!)p(G_{1:n},\sigma_{1:n})$, then your variational lower bound match exactly to that of (Chen et al., 2021).**
First, as we illustrate before, your marginalization is incorrect and your assumption about $p(\sigma_{1:n}|G_n)$ is also incorrect. Therefore, this equation is incorrect. Our model does not have this behavior.
Second, even this equation was correct, you still cannot obtain the exact VLB in Chen et al. Your VLb still does not involve $|\Pi(G_{1:n})|$ but instead $n!$.