HackMD - Collaborative Markdown Knowledge Base

## To all reviews: Dear Reviewers, Thank you all for your comments! We are grateful for the feedback and have improved the manuscript. We have posted a revision of the paper with changes highlighted in red. Our main changes are as follows: 1. We modify the description of the diffusion process as suggested by review dB1j. 2. Add a subsection to discuss the difference between our method and OM [1] in Section 3.5. 3. We highlight that our method can consistently outperform OM because we do not suffer from posterior collapse in the main results section. 4. We quantatively evaluate the node generation ordering in Table 4. 5. We add a detailed derivation of our varational lower bound (Eq.3) in Appendix A.3. 6. We empirically evaluate the KL-divergence between the diffusion ordering and generation ordering in Eq.3 indeed approaches to 0 in Appendix A.9. [1] Chen, Xiaohui, et al, Order Matters: Probabilistic Modeling of Node Sequence for Graph Generation, *ICML 2021*. ## Reviewer dB1j: **Q1. The variational lower bound is not exactly correct because this work neglects the difference between graph sequences and node orders: a graph sequence is not equivalent to a node ordering, which is analyzed by (Chen et al., 2021).** Our variational lower bound (VLB) in Eq.3 is **correct**. This is because for autoregressive diffusion, the latent variable is the graph sequence defined by the diffusion steps, which can eliminate the need of modeling graph automorphism in the corresponding VLB. In contrast, Chen et. al. [1] treat the node ordering as the latent variable and thus must account for graph automorphism in their VLB. Let us illustrate in more details below. Following [1], let us define $\mathit{\Pi}(G_{1:n})$ as the set of all node orderings that give the same graph sequence $G_{1:n}$, i.e., $\mathit{\Pi}(G_{1:n})=\{\sigma_{1:n}: G(\sigma_{1:t})=G_t, \forall t=1,\cdots,n \}$. Following [1], the conditional $p(\sigma_{1:n}|G_{1:n})$ is a uniform distribution, i.e., $p(\sigma_{1:n}|G_{1:n})=\frac{1}{|\mathit{\Pi}(G_{1:n})|}, ~~\forall \sigma_{1:n}\in\mathit{\Pi}(G_{1:n})$. Then, we have the following two equations: (1) $p_{\theta}(G_{1:n},\sigma_{1:n})=\frac{1}{|\mathit{\Pi}(G_{1:n})|}p_{\theta}(G_{1:n}), \quad (r.1)$ (2) $q_{\phi}(G_{1:n}|G_0)=|\mathit{\Pi}(G_{1:n})|q_{\phi}(\sigma_{1:n}|G_0) \quad （r.2）$. To derive the VLB, we first prove that for any function of $G_{0:n}$, i.e., $f(G_{0:n})$, its expection over the graph sequence is equal to the expectation over the node orderings. That is: $\mathbb{E}_{q(G_{1:n}|G_0)}f(G_{0:n})=\mathbb{E}_{q(\sigma_{1:n}|G_0)}f(G_{0:n}). \quad (r.3)$ Proof: \begin{align} \mathbb{E}_{q_{\phi}(G_{1:n}|G_0)}f(G_{0:n}) =& \sum_{G_{1:n}} f(G_{0:n})q_{\phi}(G_{1:n}|G_0)\nonumber \\ =&\sum_{G_{1:n}}\frac{1}{|\mathit{\Pi}(G_{1:n})|} \sum_{\sigma_{1:n}\in \mathit{\Pi}(G_{1:n})} f(G_{0:n})q_{\phi}(G_{1:n}|G_0)\nonumber \\ \overset{Eq.(r.2)}{=}&\sum_{G_{1:n}}\frac{1}{|\mathit{\Pi}(G_{1:n})|}\sum_{\sigma_{1:n}\in \mathit{\Pi}(G_{1:n})}f(G_{0:n})|\mathit{\Pi}(G_{1:n})|q_{\phi}(\sigma_{1:n}|G_0)\nonumber\\ =&\sum_{G_{1:n}}\sum_{\sigma_{1:n}\in \mathit{\Pi}(G_{1:n})}f(G_{0:n})q_{\phi}(\sigma_{1:n}|G_0)\nonumber\\ =&\sum_{\sigma_{1:n}} f(G_{0:n})q_{\phi}(\sigma_{1:n}|G_0)\nonumber\\ =&\mathbb{E}_{q_{\phi}(\sigma_{1:n}|G_0)}f(G_{0:n}). \end{align} The third last row is because iterating over the double loops of all possible graph sequences and the corresponding node orderings is equal to iterating over all possible node orderings. Then, our VLB can be written as: \begin{align} \log{p_{\theta}(G_{0})} =& \log{\left(\int p_{\theta}(G_{0:n})\frac{q_{\phi}(G_{1:n}|G_0)}{q_{\phi}(G_{1:n}|G_0)}dG_{1:n}\right)} \nonumber\\ \geq& \mathbb{E}_{q_{\phi}(G_{1:n}|G_0)}\left( \log{p_{\theta}(G_{1:n})} + \log{p_{\theta}(G_0|G_{1:n})} - \log{q_{\phi}(G_{1:n}|G_{0})} \right) \nonumber\\ \overset{Eq.(r.3)}{=}& \mathbb{E}_{q_{\phi}(\sigma_{1:n}|G_0)}\left( \log{p_{\theta}(G_{1:n})} + \log{p_{\theta}(G_0|G_{1:n})} - \log{q_{\phi}(G_{1:n}|G_{0})} \right) \nonumber \\ \overset{Eq.(r.2)}{=}& \mathbb{E}_{q_{\phi}(\sigma_{1:n}|G_0)}\left( \log{|\mathit{\Pi}(G_{1:n})|p_{\theta}(G_{1:n},\sigma_{1:n})} + \log{p_{\theta}(G_0|G_{1:n})} - \log{q_{\phi}(G_{1:n}|G_{0})} \right) \nonumber \\ \overset{Eq.(r.1)}{=}& \mathbb{E}_{q_{\phi}(\sigma_{1:n}|G_0)}\left( \log{|\mathit{\Pi}(G_{1:n})|p_{\theta}(G_{1:n},\sigma_{1:n})} + \log{p_{\theta}(G_0|G_{1:n})} - \log{|\mathit{\Pi}(G_{1:n})|q_{\phi}(\sigma_{1:n}|G_{0})} \right) \nonumber \\ =& \mathbb{E}_{q_{\phi}(\sigma_{1:n}|G_0)}( \log{|\mathit{\Pi}(G_{1:n})|p_{\theta}(G_{1:n-1}|\sigma_{1:n})p(\sigma_{1:n}|G_n)p(G_n)} + \log{p_{\theta}(G_0|G_{1:n})} \nonumber\\ & - \log{|\mathit{\Pi}(G_{1:n})|q_{\phi}(\sigma_{1:n}|G_{0})} ) \nonumber \\ =& \mathbb{E}_{q_{\phi}(\sigma_{1:n}|G_0)} ( \log{p_{\theta}(G_{1:n-1}|\sigma_{1:n})} + \log{p_{\theta}(G_0|G_{1:n})} +\log{p(\sigma_{1:n}|G_n)} +\log{p(G_n)}\nonumber \\ & - \log{q_{ \phi}(\sigma_{1:n}|G_{0})} + \log{|\mathit{\Pi}(G_{1:n})|}-\log{|\mathit{\Pi}(G_{1:n})|} ) \nonumber \\ =& \mathbb{E}_{q_{\phi}(\sigma_{1:n}|G_0)}( \log{p_{\theta}(G_{0:n-1}|\sigma_{1:n}, G_n)} +\log{p(\sigma_{1:n}|G_n)} - \log{q_{\phi}(\sigma_{1:n}|G_{0})} )\nonumber \\ =& \mathbb{E}_{q_{\phi}(\sigma_{1:n}|G_0)}\left( \log{p_{\theta}(G_{0:n-1}(\sigma_{1:n})|G_n)} +\log{p(\sigma_{1:n}|G_n)}- \log{q_{\phi}(\sigma_{1:n}|G_{0})} \right)\nonumber \\ =& \mathbb{E}_{q_{\phi}(\sigma_{1:n}|G_0)}\left( \log{p_{\theta}(G_{0:n-1}(\sigma_{1:n})|G_n)}+ \log{p(\sigma_{1:n}|G_n)}- \log{q_{\phi}(\sigma_{1:n}|G_{0})} \right)\nonumber \\ =& \mathbb{E}_{q_{\phi}(\sigma_{1:n}|G_0)} \sum_{t=0}^{n-1}\log{p_{\theta}(G_t(\sigma_{1:n}）|G_{t+1}(\sigma_{1:n}))}-\text{KL}(q_{\phi}(\sigma_{1:n}|G_0)|p(\sigma_{1:n}|G_n)). \end{align} We use $G_t(\sigma_{1:n})$ to denote the masked graph at time step $t$ under the node ordering $\sigma_{1:n}$. In the fifth last row, we have $\log p(G_n)=0$ because $G_n$ is a deterministic graph wherein all the nodes are masked. For brevity, we directly use $p_{\theta}(G_t|G_{t+1})$ to represent $p_{\theta}(G_t(\sigma_{1:n})|G_{t+1}(\sigma_{1:n}))$ in Eq. 3 since $\sigma_{1:n}$ already appears in the expectation over $q_{\phi}(\sigma_{1:n}|G_0)$. We have also added this detailed derivation of the VLB in Appendix.A.3 In contrast, OM treats the node ordering as the latent variable and its VLB follows: \begin{align} \log p_{\theta}(G)=& \log \sum_{\sigma_{1:n}} p_{\theta}(G_{1:n},\sigma_{1:n})\\ =& \log \sum_{\sigma_{1:n}} \frac{p_{\theta}(G_{1:n},\sigma_{1:n})q(\sigma_{1:n}|G)}{q(\sigma_{1:n}|G)} \\ \ge& \mathbb{E}_{q(\sigma_{1:n}|G)}[\log p_{\theta}(G_{1:n},\sigma_{1:n})-\log q(\sigma_{1:n}|G)] \\ =&\mathbb{E}_{q(\sigma_{1:n}|G_{0})}[\log \frac{1}{\Pi[G_{1:n}]}p_{\theta}(G_{1:n})-\log q(\sigma_{1:n}|G)] \end{align} As we can see, OM must compute $|\mathit{\Pi}(G_{1:n})|$ because it needs to model the joint probability $p(G,\sigma)$. **Q2. Though the formulation is motivated differently, it has a large overlap with (Chen et al., 2021). The diffusion ordering network is roughly the same as the variational distribution of node orders in (Chen et al., 2021). The variational lower bound is also very similar to (Chen et al. 2021). The difference is actually because this work neglects the difference between graph sequences and node orders: a graph sequence is not equivalent to a node ordering, which is analyzed by (Chen et al., 2021).** Our variational lower bound (VLB) in Eq.3 is different from OM [1] in three aspects: (1) The VLB in OM includes the entropy of the posterior node ordering while ours does not. This is because OM treats the node ordering as the latent variable and infers its posterior distribution which is similar to variational autoencoder (VAE). However, existing works have shown that this entropy term in VAE can cause posterior collapse [2,3]. While our method is based on autoregressive diffusion and has a simpler training objective. We believe this is a key reason why our method consistently outperforms OM empirically. (2) The VLB in OM needs to compute $|\mathit{\Pi}[G_{1:n}]|$ which requires some complicated approximation algorithms such as color refinement. Our method avoids this computation because of the diffusion process framework as we have shown in Q1. (3) In our VLB, we have a KL-divergence term between the generation ordering and the diffusion ordering. However, since graph generation is node permutation invariant, the first term in Eq. 3, i.e., $\mathbb{E}_{q(\sigma_{1},\cdots,\sigma_{n}|G_0)} \sum_{t}\log{p_{\theta}(G_t|G_{t+1})}$, encourages the model to generate the nodes in the exact reverse ordering of the diffusion process. This pushes the generation ordering close to the diffusion ordering, and the KL-divergence term will converge to 0. Therefore, we can ignore this term during training. As we illustrate in Q1, the difference of our VLB is because we treat the graph sequences in the itermediate diffusion steps as the latent variable while OM treats the node ordering as the latent variable. **Q3. The generative procedure is also similar to the standard autoregressive generative procedure: every time adding a new node and the edges connecting to this new node. I don't see the benefit of describing this procedure using the diffusion process.** As we illustrate in Q1 and Q2, the VLB derived from our diffusion process is much simpler than that derived from the standard autoregressive model as in [1]. This is because the diffusion process treats the graph sequences as the latent variable while the [1] treat the node ordering as the latent variable. The VLB derived from our diffusion process does not need to compute $|\mathit{\Pi}(G_{1:n})|$ and avoid the posterior collapse issue. Our experimental results also verify that our proposed GRAPHARM can outperform existing autoregressive methods consistently including OM [1]. **Q4. The complexity of this model should still be O(n^2) (the analysis below equation 2). Anyhow the model needs to make a decision for each node pair. The running time is saved by reducing the number of calls of neural networks (from O(n^2) to O(n)).** We have updated the description of the complexity in the manuscript to make it more precise. We meant that our model reduces the sequential generation steps from $\mathcal{O}(n^2)$ to $\mathcal{O}(n)$ to improve generation efficiency. **Q5. The description of the diffusion process can be clearer. For example, the definition of "Absorbing Node State" seems to indicate that a node enters the absorbing state after/because it is masked and connected to other nodes with masked edges. But actually, the diffusion process puts the node to an "absorbing node state" -- as a consequence, it is masked and connected to other nodes with masked edges.** Thanks for the nice suggestion. We have modified the description of the diffusion process in the updated manuscript. ### References [1] Chen, Xiaohui, et al, Order Matters: Probabilistic Modeling of Node Sequence for Graph Generation, *ICML 2021*. [2] Lucas, James, et al, Understanding Posterior Collapse in Generative Latent Variable Models. *ICLR 2019 Workshop*. [3] Lucas, James, et al, Don't Blame the ELBO! A Linear VAE Perspective on Posterior Collapse. *NeurIPS 2019*. ## Reviewer EBGH: **Q1. The paper does not clearly explain the "optimal node generation ordering" learned from the posterior. An example demonstrated in Figure 3 in the appendix shows that GraphARM tends to generate one community and then generate the other community. But more insight is required to check why that ordering is preferred in GraphARM.** The optimal node generation ordering should reflect the graph topology/regularity. We quantitatively evaluate the generation ordering of GRAPHARM on the community-small dataset and compare it with OA-ARM which uses a random generation ordering. We first generate 100 graphs using both methods and record the node generation orderings. Then, we use the spectral graph clustering method to partition the nodes into two clusters for each generated graph. Finally, we count the nodes that cross different clusters during the generation procedure. As we can see from Table 1, the average number of nodes that cross different clusters of our method is much smaller than OA-ARM. This demonstrates that our method tends first to generate one cluster and then add another cluster, while OA-ARM just randomly generates the graph. Therefore, the generation ordering of GRAPHARM can reflect the graph topology/regularity better than OA-ARM. |Method | Average Counts | |----|---| | GraphARM| 2.6| |OA-ARM|6.9| Table 1. Average number of nodes that cross different clusters on the community-small dataset. **Q2. My concern is its similarity with previous node ordering inference algorithms** Our method is different from OM [1] in two aspects: (1) The motivations are different. Our method is motivated by autoregressive diffusion and treats the graph sequences in the diffusion process as the latent variable; OM treats the node ordering as the latent variable and infers its posterior distribution in a way similar to Variational autoencoder (VAE). (2) Our training objective is much simpler than OM. First, we do not need to compute the complicated graph automorphism, which requires some approximation algorithms to compute. Second, our training objective does not involve the entropy of the node ordering distribution. Existing works [2,3] have shown that the entropy term in the VAE objective can easily cause posterior collapse. That is a key reason why our method can consistently outperform OM, as shown in our experiments. We have also added a discussion about the differences between our method and OM in Section 3.5 and a detailed derivation of our variational lower bound in Appendix A.3. ### References [1] Chen, Xiaohui, et al, Order Matters: Probabilistic Modeling of Node Sequence for Graph Generation, *ICML 2021*. [2] Lucas, James, et al, Understanding Posterior Collapse in Generative Latent Variable Models, *ICLR 2019 Workshop*. [3] Lucas, James, et al, Don't Blame the ELBO! A Linear VAE Perspective on Posterior Collapse, *NeurIPS 2019*. ## Reviewer KoQP: **Q1. Section 2: The authors mentioned that the absorbing diffusion is the most promising generation method. Can you add some explanation on that?** The absorbing diffusion is the most promising generation method because (1) It achieves the best performance on both character-level and token-level text generation compared with other discrete diffusion models [1]. (2) The absorbing diffusion has already been widely used in natural language processing. [1] shows that existing masked language models, e.g., BERT, are equivalent to absorbing diffusion. **Q2. Section 3.1: The diffusion ordering network produces the probability of the node at time t via equation (1). Can you explain why such an ordering would reflect the topology/regularities of the graph?** The diffusion ordering network learns an optimal node ordering by optimizing the variational lower bound in our training objective. The optimal node ordering can reflect the graph topology/regularity better than a random ordering. We quantitatively evaluate the generation ordering of GRAPHARM on the community-small dataset and compare it with OA-ARM which uses a random generation ordering. We first generate 100 graphs using both methods and record the node generation orderings. Then, we use the graph clustering method to partition the nodes into two clusters for each generated graph. Finally, we count the nodes that cross different clusters during the generation procedure. As we can see from Table 1, the average number of nodes that cross different clusters of our method is much smaller than OA-ARM. This demonstrates that our method tends first to generate one cluster and then add another cluster while OA-ARM just randomly generates the graph. Therefore, the generation ordering of GRAPHARM can reflect the graph topology/regularity better than OA-ARM. |Method | Average Counts | |----|---| | GraphARM| 2.6| |OA-ARM|6.9| Table 1. Average number of nodes that cross different clusters on the community-small dataset. **Q3. Section 3.3: The proposed training objective has ignored the KL-divergence term in equation (3). Can you evaluate such approximation error, ie. calculate the actual KL-divergence and check whether it indeed approaches zero?** Thanks for the nice suggestion. We have added new results to evaluate the KL approximation error. To evaluate the approximation error, we compute the KL-divergence between the diffusion ordering and the generation ordering at each time step and then sum together, i.e., $\text{Approximation error=}\sum_{t}\text{KL}\left(q(\sigma_t|G_0,\sigma_{(<t)})|p(\sigma_t|G_{t+1})\right)$. Specifically, at each time step $t$, we use GRAPHARM to first generate a graph $G_0$ and record the corresponding node generation ordering $\sigma$. Since we do not explicitly model distribution of the node index during the generation procedure, we determine the index of the generated node in $G_0$ by its connectivity to the unmasked nodes in $G_{t+1}$. Finally, we forward the generation network multiple times to obtain an empirical distribution for $p(\sigma_t|G_{t+1})$. Table 2 provides the approximation error versus the training iterations. As we can see, with the training progresses, the approximation error indeed becomes smaller and approaches to zero. |Training iterations| 0|100 |200 |500 |2000 |3000 | |----|---|---|---|---|---|---| | Approximation error|0.343|0.262 | 0.131| 0.093 |0.061 |0.049 | Table 2. Approximation error between the generation ordering and the diffusion ordering on the community-small dataset. **Q4. Table 1: The performance of the proposed GRAPHARM on Cora is not competitive. Can you explain that?** Following [3], the Cora dataset is constructed by sampling 200 subgraphs from the Cora network [2] using random walks. Due to the random walk subsampling process, the Cora dataset has less regularity in terms of graph structural patters. That is likely why our method has less advantage on this dataset. **Q5. Effect of Diffusion Ordering: Can you illustrate the proposed ordering visually to give a sense that it does reflect the topology/regularities of the graph when compared to the random ordering?** We did visualize the generation ordering in Figure 3 (in appendix A.1). As we can see, GRAPHARM first generates one community and then adds another one, which shows that GRAPHARM captures graph structural topology for generation. In contrast, OA-ARM generates the graph with a random order. ### References [1] Jacob Austin, et al, Structured Denoising Diffusion Models in Discrete State-Spaces, *NeurIPS 2021*. [2] Prithviraj Sen et al, Collective classification in network data, *AI magazine, 2008*. [3] Chen, Xiaohui, et al, Order Matters: Probabilistic Modeling of Node Sequence for Graph Generation, *ICML 2021*. ## Reviewer uUZJ: **Q1. It does not well explain what autoregressive diffusion is.** We have a detailed introduction of the autoregressive diffusion process in the Background section. We also introduce the connection between autoregressive diffusion and absorbing diffusion with continuous time limit in Appendix A.2. **Q2. Theoretical analysis is not sufficient**

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.