# Response to Reviewer UvHC
We would like to thank the reviewer for the positive rating and detailed comments. We provide point-wise responses to address your concerns below. All discussions will be updated in the final revision to further improve the quality of our paper.
**[Main points, Q1. ]**
> In the case when $k\in PA_i$, what if $f_i$ is linear in $X_k$, $f_j$ is linear in both $X_i$ and $X_k$, and the constant terms happen to add up to 0?
This case can be viewed as a violation of the faithfulness assumption.
- Consider the Gaussian case as an example, wherein $X \sim (\mu, \Sigma)$ is a Gaussian vector with mean $\mu$ and non-singular covariance matrix $\Sigma$. In this case, we have $p(x) \propto \text{exp} \left(\frac{-(x-\mu)^T \Sigma^{-1}(x-\mu)}{2}\right),$ which implies $\frac{\partial \log p(x)}{\partial x_i \partial x_k} = -(\Sigma^{-1})_{i,k},$ where the right-hand side denotes the corresponding entry of the inverse covariance matrix. Therefore, if the constant terms happen to add up to zero in the entry $(i,k)$ of the Jacobian of the score function (i.e. $\frac{\partial \log p(x)}{\partial x_i \partial x_k} = 0$), it means that variables $X_i$ and $X_k$ are independent given the rest which is not true since $k \in \text{PA}_i$.
Therefore, the provided example serves as an illustration of a scenario where the faithfulness assumption is violated.
> In the case when $k \in CH_i$, what if $f_k$ is linear in $X_i$? (also should the numerator be $\partial f_k(x_{PA_k})$?)
1. If $f_k$ is a linear function of $X_i$, then $\frac{\partial s_{i}(x)}{\partial x_{k}}$ will be a constant not equal to zero which means that $k \in \text{MB}(i)$.
2. Yes, it should be $\frac{\partial f_k(x_{\text{PA}_k})}{\partial x_i}$.
**[Main points, Q2. Real-world applications.]** According to your suggestion, we have We conducted a comparative analysis of algorithms using a well-known real-world dataset provided by Sachs et al. [1] for causal discovery with 11 nodes, 17 edges, and 853 observations. Our method demonstrated a slightly better performance on the this dataset, achieving a Structural Hamming Distance (SHD) of 11, outperforming SCORE, DAS, and CAM, which has SHD of 12, and GraN-DAG, which has SHD of 13. Examining the resulting DAG, we observed some edges that were either identified or missed by different approaches. An illustrative instance is the correct identification of the edge $1 \rightarrow 5$ in our method, which was not captured by SCORE, DAS, and CAM. Plotting the data revealed a nonlinear relationship between variable 1 and variable 5, while variable 5 and variable 6 exhibited a clear linear pattern. This accurate representation in the final graph highlights our method's capability of Lemma 2 in mixed linear and nonlinear scenarios. Nonetheless, several edges coincided with SCORE and CAM, indicating that, upon plotting the data pairs, missing edges were consistent with the absence of correlation in the plots. Furthermore, the correct partial graph consistently exhibited clear relationships between pairs of variables in all cases. However, utilizing a more refined real dataset, characterized by stronger associations between variables, could potentially yield improved results. This result will be updated in our revised paper.
**[Small points, Q1.]** In Equation (3), the notation CH has been corrected and is no longer italicized. Similarly, the notations for PA, CH, and SP in the Proof of Lemma 1 have been made consistent.
**[Small points, Q2.]** The last term in the first line of Equation (3) has been revised to $\left( \frac{n_j}{\sigma_j}\right)^2$.
**[Small points, Q3.]** After Lemma 1, the sentence has been modified as follows: *It includes undirected edges connecting variables if there exists a directed edge between them, and if they are parents of the same node in the DAG.*
**[Small points, Q4.]** The score function of $X_j$ is $s_j(x) = -\frac{x_j - f_j({x_{\text{PA}_j}})}{\sigma_j^2} + \sum_{i \in \text{CH}_j} \frac{\partial f_i({x_{\text{PA}_i}})}{\partial x_j} \frac{x_i - f_i(x_{\text{PA}_i})}{\sigma_i^2}.$ Even with the assumption that all children have linear causal models, $\frac{\partial}{\partial x_k}\frac{\partial f_i(x_{\text{PA}_i})x_i}{\partial x_j}$ should be zero. With the assumption, $\frac{\partial f_i(x_{\text{PA}_i})}{\partial x_j}$ will be constant $c\in\mathbb{R}$ which do not depend on $x_k$. Therefore, $\frac{\partial}{\partial x_k} \left( c x_i\right)$ is zero. Hence, $\forall k \in {\text{PA}}_j$, $\frac{\partial s_j(x)}{\partial x_k} = \frac{1}{{\sigma_j}^2}\frac{\partial f_j({x_{\text{PA}_j}})}{\partial x_k} - \sum_{i \in \{\text{CH}_j \cap \text{CH}_k\}} \frac{\partial}{\partial x_k} \left\{ \frac{\partial f_i({x_{\text{PA}_i}})}{\partial x_j} \frac{f_i(x_{\text{PA}_i})}{\sigma_i^2} \right\}.$
**[Small points, Q5.]** In Lemma 4, the missing backslash for log has been added.
**[Small points, Q6.]** The "=0" in (5) has been moved outside.
**[Small points, Q7.]** The notation of the kernel $\kappa$ between (5) and (6) has been introduced.
**References**
[1] Karen Sachs, Omar Perez, Dana Pe’er, Douglas A Lauffenburger, and Garry P Nolan. Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308(5721): 523–529, 2005.
# Response to Reviewer hxfS
We are very grateful for your time, insightful comments, and encouragement. Your suggestions have been invaluable in clarifying some confusing representations and further enhancing the quality of our paper.
**[Q1. Purely linear, nonlinear data, and non-additive noise.]**
> Pure linear or nonlinear case.
Theoretically, according to Lemma 1 and Lemma 5, our approach demonstrates the capability to identify the Markov Equivalence Class of the ground-truth DAG in scenarios characterized by purely linear data with Gaussian noise. Furthermore, in cases involving purely nonlinear data, our method can identify the true DAG.
According to your suggestion, we further conducted new experiments that involved varying percentages of linear and nonlinear components, including **purely linear case, 50\%-50\% linear-nonlinear mixture case, and the purely nonlinear case**. We report four metrics including the SHD, Precision, Recall and F1 score for all experiments. Details of this comparison are presented in the following table. (We conduct the experiment using an ER2 graph with 20 nodes. The results are averaged over 5 trials with different random seeds.)
**Purely linear case.**
| | SHD ($\downarrow$) | Precision ($\uparrow$)| Recall ($\uparrow$) | F1 ($\uparrow$) |
| -------- | -------- | -------- | -------- | -------- |
| NoTears | **5.00±2.97** | 0.90±0.06 | 0.92±0.03 | **0.91±0.04** |
| SCORE_CAM | 14.00±5.55 | **0.94±0.07** | 0.66±0.13 | 0.77±0.10 |
| LNMIX (Ours) | 10.00±3.47 | 0.82±0.04 | **0.93±0.06** | 0.85±0.05 |
**Linear 50%; Nonlinear 50%**
| | SHD ($\downarrow$) | Precision ($\uparrow$)| Recall ($\uparrow$) | F1 ($\uparrow$) |
| -------- | -------- | -------- | -------- | -------- |
| NoTears | 30.80±3.76 | 0.80±0.13 | 0.28±0.17 | 0.39±0.09 |
| NoTearsMLP | 27.40±5.64 | 0.65±0.12 | 0.66±0.14 | 0.66±0.12 |
| SCORE_CAM | 7.00±2.86 | **0.97±0.06** | 0.83±0.22 | 0.90±0.12 |
| LNMIX (Ours) | **5.64±1.31** | 0.92±0.04 | **0.91±0.02** | **0.91±0.01** |
**Purely nonlinear case**
| | SHD ($\downarrow$) | Precision ($\uparrow$)| Recall ($\uparrow$) | F1 ($\uparrow$) |
| -------- | -------- | -------- | -------- | -------- |
| NoTearsMLP | 21.32±2.64 | 0.75±0.21 | 0.57±0.15 | 0.65±0.12 |
| SCORE_CAM | **6.00±1.43** | **0.97±0.09** | 0.85±0.02 | **0.90±0.04** |
| LNMIX (Ours) | 8.21±2.32 | 0.82±0.05 | **0.89±0.03** | 0.84±0.03 |
> Non-addtive noise model.
According to your suggestion, we further tried to extend our method to several functional model classes in cases where the noises deviate from additivity, such as the **post-nonlinear model** and the **Heteroscedastic noise model**.
1. The post-nonlinear Model.
In the context of post-nonlinear models, $\forall i\in \mathbf{V}$, $X_i = f_i(g_i(X_{PA_i}) + N_i)$ where both $f$ and $g$ are nonlinear functions and $f$ is assumed to be invertible. The post-nonlinear transformation $f$ could represent sensor distortion or measurement distortion in the system. If $f$ in the PNL model is the identify mapping, this model reduces to the nonlinear additive noise model. Let us now suppose that both X and Y are continuous and that X is the direct cause of Y; for simplicity here we assume that X is one-dimensional and that there is no common cause for X and Y. Assume $P(X,N)=P(X)P(N)$ and let $X=h_1(X,N)$ and $Y=h_2(X,N)=f(g(x)+N)$. Since the Jacobian matrix of the transformation from $(X, N)^T$ to $(X, Y)^T$ is
$J_{h} (X, N)$ = $\begin{bmatrix} \frac{\partial h_1}{\partial X} & \frac{\partial h_1}{\partial N}\\ \frac{\partial h_2}{\partial X} &
\frac{\partial h_2}{\partial N}
\end{bmatrix}$ = $\begin{bmatrix} 1 &
0\\[1ex] % <-- 1ex more space between rows of matrix
\frac{\partial h_2}{\partial X} &
\frac{\partial h_2}{\partial N}
\end{bmatrix}$
The absolute value of its determinant is $|\mathbf{J}_h(X, N)|=\left|\frac{\partial f}{\partial N}\right|$, and hence we have $P(X,Y)=\frac{P(X,N)}{|\mathbf{J}_h(X, N)|}=P(X)P(N) \left| \frac{\partial f}{\partial N} \right| ^{-1}$ and $P(Y|X)=\frac{P(X,Y)}{P(X)}= P(N) | \frac{\partial f}{\partial N} | ^{-1}$
The same result will also apply if X contains multiple variables. Hence, the joint probability density function is given by $p(x) = \prod_{i=1}^d p(x_i|x_{PA_i}) = \prod_{i=1}^d p(n_i) \left| \frac{\partial f_i}{\partial n_i} \right| ^{-1}$
The log-likelihood is given by $\log p(x) = \log\prod_{i=1}^d p(x_i|x_{PA_i}) = \sum_{i=1}^d \log p(n_i) - \log \left| \frac{\partial f_i}{\partial n_i} \right|$
The ith component of score function $s(x)\equiv\nabla_x \log p(x)$ is
$\begin{aligned}
s_{i}(x)
& = \frac{\partial}{\partial x_i} f_i^{-1}(x_i)\frac{\partial}{\partial n_i}\log p(n_i) - \frac{\partial}{\partial x_i}\log \left| \frac{\partial f_i(g_i(x_{PA_i})+n_i)}{\partial n_i} \right|\\
& + \sum_{k \in \text{CH}_i}\left[\frac{\partial}{\partial x_i}g_k(x_{PA_k})\frac{\partial}{\partial n_k}\log(p(n_k)) - \frac{\partial}{\partial x_i}\log \left| \frac{\partial f_k(g_k(x_{PA_k})+n_k)}{\partial n_k} \right| \right]\\
\end{aligned}$
For post-nonlinear models, we can ascertain the Markov Network structure, although the identification of Lemma 2 to Lemma 4 remains elusive, which needs further investigation.
2. The Heteroscedastic noise model.
In the context of the **Heteroscedastic noise model**, $\forall j\in \mathbf{V}$, $X_{j} = f_{j}(X_{\text{PA}_j}) + h_j(X_{\text{PA}_j})\epsilon_{j}$ where the scaled noise which may depend on $X_{PA_j}$ is constructed from a standard Gaussian variable $\mathcal{N}(0,1)$ and a strictly positive scaling function s: $\mathcal{X} \rightarrow \mathcal{R}^{+}$ which means that the variance of the scaled noise variable $h(X_{\text{PA}_j})$ is equal to $h^2(X_{\text{PA}_j})$.
The log-likelihood can be expressed by the residuals $r_i = h_i(X_{PA_i})\epsilon_{i} = X_{i} - f_{i}(X_{PA_i})$ as
$\begin{aligned}
\log p(x)
& = \log\prod_{i=1}^d p(x_{i}|x_{\text{PA}_{i}};h_i^2) \\
& = \log\prod_{i=1}^d P(r_{i}|x_{\text{PA}_{i}};h_i^2) \\
& = - \frac{1}{2} \sum_{i=1}^d \log h_i^2(x_{\text{PA}_{i}}) - \frac{1}{2} \sum_{i=1}^d \frac{(x_{i} - f_{i}(x_{\text{PA}_i}))^2}{h_i^2(x_{\text{PA}_{i}})} + n\log\frac{1}{\sqrt{2\pi}}
\end{aligned}$
$\begin{aligned}
s_{i}(x)
& = \frac{\partial}{\partial x_i} -\frac{1}{2}\frac{(x_{i} - f_{i}(x_{\text{PA}_i}))^2}{h_i^2(x_{\text{PA}_{i}})} - \frac{1}{2} \sum_{k \in \text{CH}_i} \left[ \log h_k^2(x_{\text{PA}_{k}}) + \frac{(x_{k} - f_{k}(x_{\text{PA}_k}))^2}{h_k^2(x_{\text{PA}_{k}})} \right]\\
& = -\frac{x_i - f_i (x_{\text{PA}_{i}})}{h_i^2(x_{\text{PA}_{i}})} - \frac{1}{2} \sum_{k \in \text{CH}_i} \frac{\partial}{\partial x_{i}}(log h_k^2(x_{\text{PA}_{k}}))+\frac{g^\prime(x_{\text{PA}_{k}})t(x_{\text{PA}_{k}})-t^\prime(x_{\text{PA}_{k}})g(x_{\text{PA}_{k}})}{t^\prime(x_{\text{PA}_{k}})^2}\\
\end{aligned}$
where $g^\prime(x_{\text{PA}_{k}})= \frac{\partial}{\partial x_i}(x_k - f_k (x_{\text{PA}_{k}}))^2$ and $t^\prime(x_{\text{PA}_{k}}) = \frac{\partial}{\partial x_i} h_k^2(x_{\text{PA}_{k}})$
Moreover, if $X_i$ is a terminal vertex, then $\frac{\partial}{\partial X_i}s_{i}(x) = -\frac{1}{h_i^2(x_{\text{PA}_{i}})}$ and $\frac{\partial^2}{\partial x_i^2}s_{i}(x) = 0$, Therefore, $\text{var}\left(\frac{\partial^2}{\partial x_i^2}s_{i}(x)\right) = 0$.
For Heteroscedastic noise model, we can examine the derivative of the Jacobian of the score function to identify the leaf node in heteroscedastic noise models. This has led to a novel approach for solving the DAG for datasets generated by the heteroscedastic noise model. The process involves initially recovering the order through iterative identification of leaf nodes, followed by the subsequent pruning step.
**[Q2. Lower precision.]** The SCORE-CAM method deconstructs causal discovery into two key phases. (1) It establishes a topological ordering that ensures acyclicity, restricting nodes to be parents only to their successors. (2) A pruning step selects the correct subset of edges based on the inferred ordering, eliminating spurious connections.
Efficient recovery of the topological order begins with estimating the Jacobian of the score function ($\nabla_x \log p(x)$). Leaf nodes are identified by pinpointing terms with the smallest variance in the Jacobian's diagonal in experiments, as opposed to targeting terms with zero variance in theory. SCORE-CAM may only experience occasional compromises in precision if the topological order is incorrect during the pruning step. However, SCORE's preference for the smallest variance enhances robustness against estimation errors in the score's Jacobian.
In contrast, our method employs local kernels, introducing some estimation error and occasional inaccuracies in direction computation, resulting in lower precision. Therefore, the precision of our approach may intermittently be lower than that of SCORE-CAM.
**[Q3. Latent variables.]** (**Difficulty**)When laltent variables exist, the challenge arises as the factorization according to the Causal Markov Condition undergoes a shift, now incorporating unobserved variables whose nature remains unknown beforehand. (**Possible Solutions**) Current research often focuses on specific scenarios or necessitates additional information from different sources. To the best of our knowledge, a feasible solution within the Jacobian framework remains elusive in the presence of latent variables, thus presenting a promising direction for future exploration. We highlight recent research that may be useful. With a single dataset, Kaltenpoth and Vreeken [1] introduces a method for discovering a nonlinear causal model, specifically a post-nonlinear model [2], with latent confounders based on Variational Autoencoders (VAEs). When dealing with multiple observational datasets from various environments, assuming independent causal mechanisms, Karlsson and Krijthe [3] introduces a theory to identify missing connections between variables due to hidden confounders through conditional independence tests. However, the application of this solution within our framework requires further investigation.
**[Q4. Why neglect those descendant variables of $x_i$?.]** The reason for neglecting descent variables of $x_i$ in the context of the Causal Markov Condition (CMC) and subsequent factroization lies in the principle of the CMC itself. The **Causal Markov Condition** asserts that *every variable is independent of its non-descendants by conditioning on all of its parents*. In simpler terms, ignoring a variable's effect, all the relevant probabilistic information about a variable that can be obtained from a system is contained in its direct causes. Including descendants in the conditional probability would introduce dependencies that conflict with the independence assumption specified by the CMC and hence considering its descendants in the conditioning becomes redundant. This condition helps to rewrite the joint probability distribution $P$ by the famous Markov factorization: $p({x}) = \prod_{i=1}^d p_i(x_i\, |\,x_{\text{PA}_i}).$ Hence, the log-likelihood of the joint distribution is $\log p(\mathbf{x}) = \log\prod_{i=1}^d p_i(x_i\, |\,x_{\text{PA}_i}) = \sum_{i=1}^d \log p_i(x_i\, |\,x_{\text{PA}_i}).$
**[Q5.]** The second term on the right side of Equation (3) has been revised to $\left( \frac{n_j}{\sigma_j}\right)^2$.
**[Q6.]** Lemma 1 is applicable for purely linear, purely nonlinear, and mixed scenarios. It is also worth noting that in Gaussian case, where $X \sim (\mu, \Sigma)$ is a Gaussian vector with mean $\mu$ and non-singular covariance matrix $\Sigma$. In this case, we have
$p(x) \propto \text{exp} \left(\frac{-(x-\mu)^T \Sigma^{-1}(x-\mu)}{2}\right),$ which implies $\frac{\partial \log p(x)}{\partial x_i \partial x_j} = -(\Sigma^{-1})_{i,j},$ where $(\Sigma^{-1})_{i,j}$ denotes the corresponding entry of the inverse covariance matrix or the precision matrix. For Gaussian random variables, the conditional independence properties are encoded by the sparsity of the precision matrix. That is, the $(i,j)$ entry of the precision matrix is zero if and only if variables $X_i$ and $X_j$ are conditionally independent given the rest. Therefore, Lemma 1 is applicable in all scenarios with Gaussian noise.
Nonetheless, lemma 2 does not apply in cases of pure linearity. However, if the node is exclusively nonlinear and serves as a leaf node, it conforms to lemma 3 and lemma 4. When it is a leaf node, the diagonal of the Jacobian will be constant and hence the variance of it will be zero. Since there is no child for a leaf node, the entry of the Jacobian that are non-zero will only be the parents of it. This can help us recursively construct the graph in purely nonlinear scenarios.
# Response to Reviewer amz7
We appreciate the reviewer for the dedicated time, constructive suggestions, and encouraging feedback. We are grateful for your valuable input. Below, we provide detailed responses to address your comments. All discussions will be updated in the final revision to further improve the quality of our paper.
**[Q1. Evaluating Causal Discovery Approaches: A Focus on the Jacobian]** Thanks for your insightful suggestion. We have incorporated the following discussion, complete with an illustrative table, to intuitively depict the relationships between our method and previous research studies.
The Jacobian matrix is widely utilized in generative $(N \rightarrow X)$ and inference $(X \rightarrow N)$ models for identifiability and causal discovery. Various approaches also leverage the Jacobian matrix for different aspects of causal inference. In the line of causal discovery, LiNGAM [4] uses the Jacobian for inferring DAGs in linear scenarios. In contrast, Lachapelle et al. [5] computes the Jacobian of the inference network for enforcing acyclicity in nonlinear additive models. Similarly, Rolland et al. [6] focuses on the Jacobian of the score function within the same model class. The use of Jacobian properties extends into identifiability, where Independent Mechanism Analysis (IMA) assumes the generative model's Jacobian has orthogonal columns [7]. Meanwhile, Zheng et al. [8] establishes identifiability for nonlinear Independent Component Analysis based on the Jacobian of the generative model. Additionally, Atanackovic et al. [9] proposes a Bayesian approach for causal discovery in dynamical systems using the Jacobian of the SEMs. Conversely, Reizinger et al. [10] employs the Jacobian of the inference model for causal models with unconstrained function classes and non-i.i.d. data, providing identifiability guarantees.
Our approach builds upon the Jacobian of the score function, extending the work of Rolland et al. [6] to encompass mixed linear and nonlinear data. For a concise summary, see Table below, an extension of Table 5 in [10].
| Method | $f$ | Data | $J$ | CD | Id. |
|--------------------------------|--------------------------|-----------|---------------------------|-----|-----|
| Shimizu et al. [4] | Linear | Non-Gaussian | $J_{f^{-1}}$ | ✔ | ✔ |
| Lachapelle et al. [5] | Additive | Gaussian | $J_{f^{-1}}$ | ✔ | ✘ |
| gresele et al. [7] | IMA | ✔ | $J_{f}$ | ✘ | ✔ |
| Zheng et al. [8] | Sparse | ✔ | $J_{f}$ | ✘ | ✔ |
| Rolland et al. [6] | Additive | Gaussian | $J_{\nabla_x log(x)}$ | ✔ | ✘ |
| Atanackovic et al. [9] | Cyclic (ODE) | ✔ | $J_{f}$ | ✔ | ✘ |
| Reizinger et al. [10] | ✔ | Assums.2, F.1 | $J_{f^{-1}}$ | ✔ | ✔ |
| **Ours** | Mixed Linear & Nonlinear | Gaussian | $J_{\nabla_x log(x)}$ | ✔ | Partial |
The table includes columns denoted as follows: Column $f$ signifies constraints on the function class of the SEMs, the Data column enumerates restrictions on the data distribution, $J$ describes the Jacobian of the employed function, CD indicates its application in causal discovery, and the Id. column denotes whether the method provides identifiability guarantees. For detailed information on Assumption 2 and Proposition 1, please refer to [10] for specifics on Assums.2 and F.1.
**[Q2. Real-world applications.]** According to your suggestion, we have conducted a comparative analysis of algorithms using a well-known real-world dataset for causal discovery [11] with 11 nodes, 17 edges, and 853 observations. Our method demonstrated a slightly better performance on the Sachs dataset, achieving a Structural Hamming Distance (SHD) of 11, outperforming SCORE, DAS, and CAM, which has SHD of 12, and GraN-DAG, which has SHD of 13. Examining the resulting DAG, we observed some edges that were either identified or missed by different approaches. An illustrative instance is the correct identification of the edge $1 \rightarrow 5$ in our method, which was not captured by SCORE, DAS, and CAM. Plotting the data revealed a nonlinear relationship between variable 1 and variable 5, while variable 5 and variable 6 exhibited a clear linear pattern. This accurate representation in the final graph highlights our method's capability of Lemma 2 in mixed linear and nonlinear scenarios. Nonetheless, several edges coincided with SCORE and CAM, indicating that, upon plotting the data pairs, missing edges were consistent with the absence of correlation in the plots. Furthermore, the correct partial graph consistently exhibited clear relationships between pairs of variables in all cases. However, utilizing a more refined real dataset, characterized by stronger associations between variables, could potentially yield improved results.
**[Q3. Definition of moral graph.]** The definition of the moral graph has been introduced in Section 3, under the heading "Preliminaries." While an adjustment to the current manuscript version is not feasible at this time, this update will be incorporated in the upcoming revision.
**[Q4. High-level overview of the Contributions.]** Our contributions can be summarized as follows:
- In a practical setting with mixed linear and nonlinear causal mechanisms, our approach surpasses traditional algorithms like the PC algorithm, expanding the scope of identified directions beyond the Markov Equivalence class.
- We introduce an algorithm for inferring causal graphs in ANMs, accommodating both non-identifiable and identifiable components. Our approach broadens applicability, offering a robust alternative with theoretical guarantees, especially in challenging settings where validating Gaussian noise assumptions is difficult.
- Our algorithm alleviates the need for exhaustive searches, addressing concerns related to multiple testing. It enables efficient parallel processing, reducing computational complexity with a larger number of variables.
While an adjustment to the current manuscript version is not feasible at this time, this update will be incorporated in the upcoming revision.
**[Q5. Rationale for Computing the Jacobian's Support.]** In Lemma 1, we establish that vertex $j$ is excluded from the Markov Blanket of vertex $i$ if and only if the Jacobian of the score function, denoted as $\frac{\partial s_i(x)}{\partial x_j}$, is zero. In Section 3, we introduce the notion that for any matrix $\mathbf{M}$, its support set is denoted as $\text{Supp}(\mathbf{M}) = \left\{(i,j) \in \mathbf{V} \times \mathbf{V} \, |\, \mathbf{M}_{i,j}\neq 0 \right\}$. When considering vertices $i$ and $j$, the Jacobian of the score function is computed, yielding some entries that are either non-zero or nearly zero (attributed to estimation errors, which are deemed zero if below a specified threshold). Utilizing the support set, if the entry $(i,j)$ in the Jacobian of the score function is zero, it signifies the absence of an edge between vertices $i$ and $j$. Conversely, if the entry $(i,j)$ in the Jacobian of the score function is one denotes the presence of an edge between vertices $i$ and $j$. The computation of $\text{Supp}(\frac{\partial^2 \log p(x)}{\partial x_i \partial x_j})$ facilitates the derivation of the moral graph.
**References**
[1] David Kaltenpoth and Jilles Vreeken. Nonlinear causal discovery with latent confounders. In International Conference on Machine Learning, pages 15639–15654. PMLR, 2023.
[2] Kun Zhang and Aapo Hyvarinen. On the identifiability of the post-nonlinear causal model. arXiv preprint arXiv:1205.2599, 2012.
[3] Rickard Karlsson and Jesse H Krijthe. Detecting hidden confounding in observational data using multiple environments. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
[4] Shohei Shimizu, Patrik O Hoyer, Aapo Hyvärinen, Antti Kerminen, and Michael Jordan. A linear non-gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7(10), 2006.
[5] Sébastien Lachapelle, Philippe Brouillard, Tristan Deleu, and Simon Lacoste-Julien. Gradient- based neural dag learning. arXiv preprint arXiv:1906.02226, 2019.
[6] Paul Rolland, Volkan Cevher, Matthäus Kleindessner, Chris Russell, Dominik Janzing, Bernhard Schölkopf, and Francesco Locatello. Score matching enables causal discovery of nonlinear additive noise models. In International Conference on Machine Learning, pages 18741–18753. PMLR, 2022.
[7] Luigi Gresele, Julius Von Kügelgen, Vincent Stimper, Bernhard Schölkopf, and Michel Besserve. Independent mechanism analysis, a new concept? Advances in neural information processing systems, 34:28233–28248, 2021.
[8] Yujia Zheng, Ignavier Ng, Yewen Fan, and Kun Zhang. Generalized precision matrix for scalable estimation of nonparametric markov networks. arXiv preprint arXiv:2305.11379, 2023
[9] Lazar Atanackovic, Alexander Tong, Jason Hartford, Leo J Lee, Bo Wang, and Yoshua Bengio. Dyngfn: Bayesian dynamic causal discovery using generative flow networks. arXiv preprint arXiv:2302.04178, 2023.
[10] Patrik Reizinger, Yash Sharma, Matthias Bethge, Bernhard Schölkopf, Ferenc Huszár, and Wieland Brendel. Jacobian-based causal discovery with nonlinear ica. Transactions on Machine Learning Research, 2022.
[11] Karen Sachs, Omar Perez, Dana Pe’er, Douglas A Lauffenburger, and Garry P Nolan. Causal protein signaling networks derived from multiparameter single-cell data. Science, 308(5721): 523–529, 2005.
# Response to Reviewer Ez9N
We greatly appreciate the reviewer’s time, insightful comments, and constructive suggestions. Your contributions have greatly enriched the quality and depth of our work. Thank you sincerely for your dedicated and thoughtful review.
**[Q1. Assumptions.]** In this study, all analyses fall within the assumption of causal sufficiency which is no latent variables in the true causal model. in the true causal model. A dedicated section for these assumptions will be included in the current manuscript, outlined as follows:
Let $\mathcal{G}$ be a causal graph with vertex set $\mathbf{V}$ and $P$ be a probability distribution over the vertices in $\mathbf{V}$ generated by the causal structure represented by $\mathcal{G}$.
**Assumption 1** (Causal Sufficiency). All confounders of the relevant variables are observed in the given dataset.
**Assumption 2** (Causal Markov Condition). $\mathcal{G}$ and $P$ satisfy the Causal Markov Condition if and only if for every $W$ in $\mathbf{V}$, $W$ is independent of $\mathbf{V}\backslash(\text{DE}(W) \cup \text{PA}(W))$ given $\text{PA}(W)$.
**Assumption 3** (Faithfulness). $\mathcal{G}$ and $P$ satisfy the Faithfulness if and only if every conditional independence relation true in $P$ is entailed by the Causal Markov Condition applied to $\mathcal{G}$.
**[Q2. Terminology of ‘spouse’.]** The term "Spouse" to describe the Markov Blanket is widely used in the field of causal discovery [1,2,3]. For instance, Pellet and Elisseeff [1] states that "*Assuming faithfulness, $\textbf{Mb}(X)$ is unique. In a DAG, it corresponds to the parents, children, and children's parents (spouses) of X.*" Nonetheless, your suggestion is excellent; however, our usage of the term aligns with these established definitions from prior works in the field.
**[Q3. Requirement of Gaussian Noise.]**
1. (Gaussian noise.) Yes, we assume Gaussian noise in our work resulting in a partial DAG in the context of Mixed Linear and Nonlinear Additive **Gaussian** Noise Models. We will update the title accordingly. It's essential to note that while Gaussianity is assumed for our approach, not all lemmas necessitate this assumption.
2. (Arbitrary noise.) When dealing with arbitrary noise, NoGAM [4] identifies the leaf node by seeking a consistent estimator capable of precisely predicting the score function associated with a leaf based on the residual estimation. Although Lemma 1 and Lemma 5 are applicable to all distributions, Lemmas 2 to 4 are specifically tailored for the Gaussian case. Furthermore, our primary objective is to uncover the DAG within data generated by a mixed linear and nonlinear Gaussian additive noise model, where the identifiability of linear Gaussian is challenging. Therefore, this aspect is not currently within our scope but may serve as a potential direction for future exploration.
**[Q4. Enuermation of equations.]** We appreciate your suggestion. Although we are unable to modify the current version of the manuscript at this stage, we will ensure to enumerate all equations in the upcoming revision.
**[Q5. Typo in Equation (3).]** The second term on the right side of Equation (3) has been revised to $\left( \frac{n_j}{\sigma_j}\right)^2$.
**[Q6. Summation cancels out in the calculation of the Jacobian and the unicity of MB under faithfulness assumption.]**
> Summation cancels out in the calculation of the Jacobian
This can be viewed as a violation of the faithfulness assumption. It is worth noting that in Gaussian case, where $X \sim (\mu, \Sigma)$ is a Gaussian vector with mean $\mu$ and non-singular covariance matrix $\Sigma$. In this case, we have $p(x) \propto \text{exp} \left(\frac{-(x-\mu)^T \Sigma^{-1}(x-\mu)}{2}\right),$ which implies $\frac{\partial \log p(x)}{\partial x_i \partial x_k} = -(\Sigma^{-1})_{i,k},$ where $(\Sigma^{-1})^2_{i,k}$ denotes the corresponding entry of the inverse covariance matrix. Therefore, if the constant terms happen to add up to zero in the entry $(i,k)$ of the Jacobian of the score function (i.e. $\frac{\partial \log p(x)}{\partial x_i \partial x_k} = 0$), it means that variables $X_i$ and $X_k$ are independent given the rest which is not true since $k \in \text{PA}_i$. Therefore, the provided example serves as an illustration of a scenario where the faithfulness assumption is violated.
> The unicity of MB under faithfulness assumption
The if and only if condition requires the faithfulness assumption outlined in Lemma 1. Several studies, such as those by [5,6,7] and Nilsson et al. [8] have established that identifying the unique Markov blanket relies on the faithfulness assumption. In all these works, faithfulness serves as a crucial prerequisite for validating the correctness of the algorithms.
The proof can be summarized as follows: in the context of directed graphical models, the Markov blanket of a node $X$, denoted as $\text{MB}(X)$, consists of its parents, children, and children's parents (spouses). Notably, it holds as a straightforward property that: $X \in \text{MB}(Y) \,\,\Longleftrightarrow\,\, Y \in \text{MB}(X).$ A key property of Markov blankets is stated as follows: **Property** In the context of a faithful causal graph $\mathcal{G}$, we have: $\forall X, Y \in \mathbf{V}: (X\in\text{MB}(Y)\,\,\Longleftrightarrow\,\,(X\not\!\perp\!\!\!\perp Y | \mathbf{V}\backslash\{X,Y\})).$ This property asserts that the Markov blanket of each node is the set of all variables that are dependent on it, conditioned on all other variables and the faithfulness assumption guarantees the unicity of $\text{MB}(Y)$
**[Q7. Comparison to SCORE and DAS.]** SCORE algorithm initiates the process by efficiently recovering the topological order through the estimation of the Jacobian of the score function (i.e., $\nabla_x \log p(x)$). Leaf nodes are identified by locating terms with zero variance in the diagonal of the score's Jacobian. The fully connected DAG is then pruned using the method proposed in CAM [9]. Recognizing the computational demands of the pruning step, Montagna et al. [10] proposed DAS, which eliminates this step by solely relying on the Jacobian of the score, resulting in enhanced efficiency.
1. (**Purely linear case.**) Both SCORE [11] and DAS [10] fail to identify the causal graph if $f_i$ is linear in the model (Equation (1)), $\forall i = 1,...,d$. The topological ordering method on a set $X \in \mathbb{R}^d$ fails in the linear setting, as the topological ordering between variables itself does not disambiguate the direction of edges in the DAG. For example, in the case of an SCM defined in model (1) with linear functions $f_i$, the Jacobian of the score function $\frac{\partial s_i}{\partial x_i} = -\frac{1}{\sigma_i^2} + c$ for every node $i$ in the graph, where $c \in \mathbb{R}$ is a constant. Thus, $\text{var}\left[\frac{\partial s_i}{\partial x_i}\right]=0, \forall i = 1,...,d$. Since the variance vanishes for each node $i$ rather than for leaves only, the criterion of identifying the leaf node (Lemma 4 and Lemma 5) does not hold anymore, leading to a failure of the topological ordering method for the linear case. However, with purely linear data, the Markov Equivalence class can still be attained in our work using Lemma 1 to derive the Markov Network and Lemma 5 to address colliders, similar to the methodologies employed in constraint-based and score-based approaches. Nevertheless, our approach demonstrates superior scalability as the number of nodes increases.
2. (**Purely nonlinear and the mixed cases.**) In the context of mixed linear and nonlinear data, both SCORE and DAS again encounter challenges due to the non-identifiability of linear Gaussian data, where the variance of the Jacobian of the score function becomes zero, except for the nonlinear leaf nodes. Consequently, our approach significantly broadens the scope of applicability for causal discovery, providing theoretical guarantees in situations that pose difficulties for other existing methods. Overall, both SCORE and DAS are only suitable for a nonlinear additive Gaussian noise model, whereas our method can handle linear (up to the Markov Equivalence Class), nonlinear, and mixed linear and nonlinear scenarios.
**[Q8. PNL model.]** In our study, we restrict our focus on the identifiable Additive Noise Models (ANMs). However, it is worthwhile to undertake an exploration of the PNL model as you suggested.
For $\forall i\in \mathbf{V}$, $X_i = f_i(g_i(X_{PA_i}) + N_i)$ where both $f$ and $g$ are nonlinear functions and $f$ is assumed to be invertible. The post-nonlinear transformation $f$ could represent sensor distortion or measurement distortion in the system. If $f$ in the PNL model is the identify mapping, this model reduces to the nonlinear additive noise model. Let us now suppose that both X and Y are continuous and that X is the direct cause of Y; for simplicity here we assume that X is one-dimensional and that there is no common cause for X and Y. Assume $P(X,N)=P(X)P(N)$ and let $X=h_1(X,N)$ and $Y=h_2(X,N)=f(g(x)+N)$. Since the Jacobian matrix of the transformation from $(X, N)^T$ to $(X, Y)^T$ is
$J_{h} (X, N)$ = $\begin{bmatrix}
\frac{\partial h_1}{\partial X} &
\frac{\partial h_1}{\partial N}\\
\frac{\partial h_2}{\partial X} &
\frac{\partial h_2}{\partial N}
\end{bmatrix}$ = $\begin{bmatrix}
1 &
0\\
\frac{\partial h_2}{\partial X} &
\frac{\partial h_2}{\partial N}
\end{bmatrix}$
The absolute value of its determinant is $|\mathbf{J}_h(X, N)|=\left|\frac{\partial f}{\partial N}\right|$, and hence we have $P(X,Y)=\frac{P(X,N)}{|\mathbf{J}_h(X, N)|}=P(X)P(N) \left| \frac{\partial f}{\partial N} \right| ^{-1}$ and $P(Y|X)=\frac{P(X,Y)}{P(X)}= P(N) | \frac{\partial f}{\partial N} | ^{-1}$
The same result will also apply if X contains multiple variables. Hence, the joint probability density function is given by $p(x) = \prod_{i=1}^d p(x_i|x_{PA_i}) = \prod_{i=1}^d p(n_i) \left| \frac{\partial f_i}{\partial n_i} \right| ^{-1}$
The log-likelihood is given by $\log p(x) = \log\prod_{i=1}^d p(x_i|x_{PA_i}) = \sum_{i=1}^d \log p(n_i) - \log \left| \frac{\partial f_i}{\partial n_i} \right|$
The ith component of score function $s(x)\equiv\nabla_x \log p(x)$ is
$\begin{aligned}
s_{i}(x)
& = \frac{\partial}{\partial x_i} f_i^{-1}(x_i)\frac{\partial}{\partial n_i}\log p(n_i) - \frac{\partial}{\partial x_i}\log \left| \frac{\partial f_i(g_i(x_{PA_i})+n_i)}{\partial n_i} \right|\\
& + \sum_{k \in \text{CH}_i}\left[\frac{\partial}{\partial x_i}g_k(x_{PA_k})\frac{\partial}{\partial n_k}\log(p(n_k)) - \frac{\partial}{\partial x_i}\log \left| \frac{\partial f_k(g_k(x_{PA_k})+n_k)}{\partial n_k} \right| \right]\\
\end{aligned}$
In the context of post-nonlinear models, we can ascertain the Markov Network structure, although Lemmas 2 to 4 are explicitly designed for situations involving non-identifiable linear Gaussian. Our main focus lies in uncovering the DAG from datasets generated by a mixed linear and nonlinear Gaussian additive noise model, presenting a challenge in identifying the linear Gaussian component. Regarding other causal models that are not identifiable within the PNL framework, as outlined in Table I in [12], we currently lack a definitive solution for them, and addressing this specific challenge is beyond the present scope of our study. Nevertheless, we plan to investigate this aspect further, considering it as a potential direction for future exploration.
**[Q9. Requirement of Gaussianity in the equation preceding 3.]** Your observation is accurate. Gaussianity is not a prerequisite for Lemma 1. The equation preceding equation 3 does not require Gaussianity either, while the equation 3 does rely on Gaussianity. An example when the noise variable follows a Gumbel distribution, Lemma 1 remains valid. When $N_{i}$ is non-Gaussian, for example, $N_{i} \sim Gumbel(0,1)$. The ith component of the score function is
$\begin{aligned}
s_{i}(x) &= -\frac{\partial}{\partial n_i} (n_i + e^{-n_i}) + \sum_{k \in CH_i} \frac{\partial}{\partial x_{i}}f_{k}(x_{PA_k}) \frac{\partial}{\partial n_k} (n_k + e^{-n_k})\\
& = - (1 - e^{-(x_i - f(x_{PA_i}))}) + \sum_{k \in CH_i} \frac{\partial}{\partial x_{i}}f_{k}(x_{PA_k})(1 - e^{-(x_k - f(x_{PA_k}))})\\
\end{aligned}$.
An observation from the equation is that if node $j$ is part of the Markov blanket of node $i$, then $\frac{\partial s_i(x)}{\partial x_j} \neq 0$. Hence, Lemma 1 remains valid. In the current manuscript, we revised our approach so that the proof of steps, which hold without the assumption of Gaussianity, is presented in general terms. We substitute the form of the scores with a Gaussian score only when necessary.
**[Q10. Use of "In other words".]** We have made the change to "Alternatively."
**[Q11. Adjustment of Q7]** Yes, it should indeed be $\text{PA}_j$, and what we intend to express there is that the function has no dependence on $(\text{SP}_i \cup i)\backslash \text{PA}_j$.
**[Q12. Typo in Appendix A.]** Yes, those were typos. They will be corrected to $x$ in the final version.
**[Q13. Clarification of Nonlinearity.]**
- Certainly, Equation 10 is applicable without the necessity for nonlinearity, making it suitable for both linear and nonlinear scenarios.
- Yes, it's crucial to emphasize in this study that linearity is not encompassed by nonlinearity. In Lemma 2, it is explicitly stated that the condition must be nonlinear excluding linearity, and also continuously twice differentiable. In the **Preliminaries** section, we have provided a clear definition of nonlinear function within the **Model definition**. we have also adjusted the sufficient condition in the proof of Lemma 2 in Appendix B accordingly.
**[Q14. Clarification of Equation 10.]**
- Considering the SEMs outlined above, the score function of $X_j$ is as follows, $s_j(x) = -\frac{x_j - f_j({x_{\text{PA}_j}})}{\sigma_j^2} + \sum_{i \in \text{CH}_j} \frac{\partial f_i({x_{\text{PA}_i}})}{\partial x_j} \frac{x_i - f_i(x_{\text{PA}_i})}{\sigma_i^2}$
When we do not mention that $f_i$ are linear functions. $\forall k \in {\text{PA}}_j$, it should be $\frac{\partial s_j(x)}{\partial x_k} = \frac{1}{{\sigma_j}^2}\frac{\partial f_j({x_{\text{PA}_j}})}{\partial x_k} + \sum_{i \in \{\text{CH}_j \cap \text{CH}_k\}} \frac{\partial}{\partial x_k} \left\{ \frac{\partial f_i({x_{\text{PA}_i}})}{\partial x_j} \frac{x_i - f_i(x_{\text{PA}_i})}{\sigma_i^2} \right\}$
But in Lemma 2, since we say that $f_j$ is a nonlinear function and all children of node $j$ are linear functions of $X_j$ and their associated parents which means that $f_i$ are linear functions, therefore, $\frac{\partial}{\partial x_k}\frac{\partial f_i(x_{\text{PA}_i})x_i}{\partial x_j}$ should be zero since $\frac{\partial f_i(x_{\text{PA}_i})}{\partial x_j}$ will be constant $c\in\mathbb{R}$ which do not depend on $x_k$. Therefore, $\frac{\partial}{\partial x_k} \left( c x_i\right)$ is zero. This leads to $\forall k \in {\text{PA}}_j$, $\frac{\partial s_j(x)}{\partial x_k} = \frac{1}{{\sigma_j}^2}\frac{\partial f_j({x_{\text{PA}_j}})}{\partial x_k} - \sum_{i \in \{\text{CH}_j \cap \text{CH}_k\}} \frac{\partial}{\partial x_k} \left\{ \frac{\partial f_i({x_{\text{PA}_i}})}{\partial x_j} \frac{f_i(x_{\text{PA}_i})}{\sigma_i^2} \right\}$ and $\frac{\partial s_j(x)}{\partial x_k} = \frac{1}{{\sigma_j}^2}\frac{\partial f_j({x_{\text{PA}_j}})}{\partial x_k} + c$ where the second term on the right-hand side of the equation is a constant that does not depend on any variable. Since $f_j({x_{\text{PA}}}_j)$ is a nonlinear function of $x_k$, $\frac{\partial s_j(x)}{\partial x_k}$ depends on $x_k$ and hence, $\text{Var}_x[\frac{\partial s_j(x)}{\partial x_k}] \neq 0$.
- Yes, it should be a negative sign in the equation below equation 10.
**[Q15. Clarification of Linearity and Nonlinearity.]** Throughout this study, **nonlinearity specifically excludes linearity**. Following the clarification in Q13, adjustments will be made to the Model definition in the Preliminary section. To provide clarity on when nonlinearity excludes linearity and the role of Gaussianity, we go through each lemma and its corresponding proof as follows:
- Lemma 1: The statement mentions that "*...a random variable $X$ defined via an additive **Gaussian** noise model*" but does not specify linearity or nonlinearity. This means it applies to both linear and nonlinear, as well as their combinations. The proof aligns with this notion. Although this lemma does not strictly require Gaussianity, the proof is conducted based on Gaussianity.
- Lemma 2: The statement involves "*...random variables $X \in \mathbb{R}^d$ defined via a model involving a combination of **linear and nonlinear** additive **Gaussian** noise ANMs. Suppose $X_j$'s causal model $f_j$ is nonlinear, and all children of node $j$ have linear causal models.*" Here, nonlinearity explicitly rules out linearity. In the context of this lemma, if i) were to incorporate the possibility of nonlinearity including linearity, then ii) would not be valid. Therefore, in this lemma, we consistently assume that nonlinearity does not encompass linearity. Furthermore, this lemma necessitates the assumption of Gaussianity.
- Lemma 3 \& Lemma 4: The statements indicate that "*...a random variable $X$ defined via a **nonlinear** additive **Gaussian** noise model*", where nonlinear excludes linearity, and Gaussianity is assumed.
- Lemma 5: Similar to Lemma 1, it is suitable for both linear and nonlinear cases, and Gaussianity is not a strict requirement.
**[Q16. Requirement of Guassianity.]**
- Yes, all equations before equation 3 do not require Gaussianity and are presented in a general form. Please refer to the respons to Q9 as well. Gaussianity is assumed only after stating, "When $N_{i} \sim \mathcal{N}(0,\sigma_i^2)$...".
- Although Eq 10 to 12 are indeed based on Gaussianity, their application is not strictly limited to it. As long as $p(n_j)$ is part of some exponential family or includes a polynomial function. This condition ensures that $\text{PA}_i, \text{CH}_i$, and $\text{SP}_i$ remain in the score function, allowing for the derivation of the Jacobian of the score function. Please also refer to the answer to Q9. However, if Gaussianity is not assumed, the procedure for obtaining the DAG may differ from our work, and this aspect goes beyond the scope of our study. Exploring this possibility as a future direction in a more general form could be considered.
**[Q17&Q18. Comparison to PC algorithm.]** Q17 and Q18 both relate to the final partially oriented graph obtained in our work and involve a comparison with the PC algorithm, prompting us to address them collectively. Yes, Lemma 1, used to compute the moral graph, corresponds to the step of determining adjacencies in the PC algorithm. Lemma 5 utilizes the Jacobian of the score function to infer V-structures, equivalent to the orientation rule in the PC algorithm. However, Lemma 2 enhances our ability to ascertain directions, even in non-collider scenarios.
To illustrate this enhancement, consider a simple 3-node example: a chain $i \rightarrow j \rightarrow k$. Consider the corresponding Structural Causal Model (SCM) as follows, $f_j$ is a nonlinear function of variable $X_i$, and $f_k$ is a linear function of $X_j$. Lemma 2 allows us to recover the true underlying DAG in this case. However, the PC algorithm can only recover up to the Markov Equivalence class, resulting in $i - j - k$. This underscores the additional inferential power provided by our approach. Importantly, our method ensures that the set of causal structures compatible with our final partially oriented graph is a subset of the causal structures within the Markov equivalence class.
Furthermore, our approach exhibits superior scalability as the number of nodes increases. Constraint-based methods, such as PC [13], fast causal inference (FCI) [14], and SGS [15], assess the conditional independence among variables and seek graph structures that satisfy these conditions under a faithfulness assumption. However, the main bottleneck in these methods lies in the challenging nature of conditional independence testing [16]. On the other hand, score-based techniques involve defining a suitable score function and searching for the graph that best fits the data within an extensive graph space. Greedy approaches like greedy equivalence search (GES) [17,18] are employed for this exploration, but their scalability is hampered by the super-exponential growth of the space with the number of nodes.
While an adjustment to the current manuscript version is not feasible at this time, the upcoming revision will incorporate this update.
**References**
[1] Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan kaufmann, 1988.
[2] Jean-Philippe Pellet and André Elisseeff. Finding latent causes in causal networks: an efficient approach based on markov blankets. Advances in Neural Information Processing Systems, 21, 2008.
[3] Constantin F Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani, and Xenofon D Koutsoukos. Local causal and markov blanket induction for causal discovery and feature selection for classification part i: algorithms and empirical evaluation. Journal of Machine Learning Research, 11(1), 2010.
[4] Francesco Montagna, Nicoletta Noceti, Lorenzo Rosasco, Kun Zhang, and Francesco Locatello. Causal discovery with score matching on additive models with arbitrary noise. arXiv preprint arXiv:2304.03265, 2023a.
[5] Ioannis Tsamardinos, Constantin F Aliferis, Alexander R Statnikov, and Er Statnikov. Algorithms for large scale markov blanket discovery. In FLAIRS conference, volume 2, pages 376–380. St. Augustine, FL, 2003.
[6] Ioannis Tsamardinos, Laura E Brown, and Constantin F Aliferis. The max-min hill-climbing
bayesian network structure learning algorithm. Machine learning, 65:31–78, 2006.
[7] Constantin F Aliferis, Ioannis Tsamardinos, and Alexander Statnikov. Hiton: a novel markov blanket algorithm for optimal variable selection. In AMIA annual symposium proceedings, volume 2003, page 21. American Medical Informatics Association, 2003.
[8] Roland Nilsson, José M Pena, Johan Björkegren, and Jesper Tegnér. Consistent feature selection for pattern recognition in polynomial time. The Journal of Machine Learning Research, 8: 589–612, 2007.
[9] Peter Bühlmann, Jonas Peters, and Jan Ernest. Cam: Causal additive models, high-dimensional order search and penalized regression. 2014.
[10] Francesco Montagna, Nicoletta Noceti, Lorenzo Rosasco, Kun Zhang, and Francesco Locatello. Scalable causal discovery with score matching. arXiv preprint arXiv:2304.03382, 2023b.
[11] Paul Rolland, Volkan Cevher, Matthäus Kleindessner, Chris Russell, Dominik Janzing, Bernhard Schölkopf, and Francesco Locatello. Score matching enables causal discovery of nonlinear additive noise models. In International Conference on Machine Learning, pages 18741–18753. PMLR, 2022.
[12] Kun Zhang and Aapo Hyvarinen. On the identifiability of the post-nonlinear causal model. arXiv preprint arXiv:1205.2599, 2012.
[13] Peter Spirtes and Clark Glymour. An algorithm for fast recovery of sparse causal graphs. Social science computer review, 9(1):62–72, 1991.
[14] Peter Spirtes. An anytime algorithm for causal inference. In International Workshop on Artificial Intelligence and Statistics, pages 278–285. PMLR, 2001.
[15] Peter Spirtes, Clark N Glymour, and Richard Scheines. Causation, prediction, and search. MIT press, 2000.
[16] Rajen D Shah and Jonas Peters. The hardness of conditional independence testing and the generalised covariance measure. 2020.
[17] David Maxwell Chickering. Optimal structure identification with greedy search. Journal of machine learning research, 3(Nov):507–554, 2002.
[18] Biwei Huang, Kun Zhang, Yizhu Lin, Bernhard Schölkopf, and Clark Glymour. Generalized score functions for causal discovery. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1551–1560, 2018.
**Follow up question**:
Q6: I understand your answer, but as you also indicate in your reply to Q1, faithfulness as standardly conceived, establishes an if and only if relation between the graph and conditional independencies. My point is that in general you can have cases in which there is a dependence but it does not reflect in a non-null score.
Just to make an analogy, you may have some method of causal learning that relies on linear dependencies, and the faithfulness assumption for conditional independencies does not guarantee that the existence of a dependence will be manifested in a linear one. It can be that the dependence occurs only in higher-order terms.
Same thing in your case, intuitively it seems to me that in general the score does not necessarily have to be sensitive to all sorts of dependence, you may have a case in which a dependence exists as will be captured by a nonnegative mutual information, and yet the score is insensitive to that particular order and structure of dependencies. Therefore, invoking faithfulness at the level of conditional independencies will not be enough to guarantee faithfulness at the level of scores.
But maybe you are saying that with Gaussian noises it is enough? This is not intuitive to me, because even if the noises are Gaussian the functions are allowed to be nonlinear, so you have higher-order dependencies involved, not everything depends on the structure of the joint covariance matrix.
It seems to me that you need to invoke a more stringent sort of faithfulness specific of the scores, or alternatively some assumption on modularity of the sort I was mentioning so that the parameters of different functional equation cannot be tuned in order to cancel scores.
As I was saying in Q6c the same issue likely requires some sort of additional faithfulness assumption in some if not all lemmas 2-5.
I also checked all the comments of all other Reviewers and your answers and I am glad to see that you gave insightful answers to those questions of others Reviewers that retrospectively I see were not clear to me either.
**ANSWER**
Thank you so much for your thoughtful follow-up question and for dedicating time to review the provided answers. We sincerely appreciate your positive feedback and are pleased to learn that you found the answers helpful.
> Relationship between Conditional Independencies and Score Function
The assumption of faithfulness is invoked between conditional independencies and the Markov blanket as discussed in **The unicity of MB under faithfulness assumption** of the answer in Q6, rather than directly associating conditional independencies with the score function.
The findings into the relationship between conditional independence and score function aligns with pior results in the context of Markov netkworks [1,2]. Given a collection of random variables $X = (X_1, X_2, \cdots, X_d)$ with joint density $p(x)$, the information about conditional independencies among these variables of $X$ can be embedded in a simple undirected Markov network, denoted as $\mathcal{G}^m = (\mathbf{V}, \mathbf{E})$. Here, edges $(i,j)$ encode some sort of probabilistic interaction between the pairs of random variables $X_i$ and $X_j$. In particular, Spantini et al. [1] demonstrated how to construct a Markov graph by interpreting the conditional independence of pairs of random variables as follows:
$X_i \perp \!\!\! \perp X_j | X_{\mathbf{V}\backslash\{i,j\}} \Longleftrightarrow \frac{\partial^2 \log p(x)}{\partial x_i x_j}=0$
where $\frac{\partial^2 \log p(x)}{\partial x_i \partial x_j}$ denotes the $ij-$th entry of the Jacobian of the score. By adding edges between nodes that do not satisfy this equation, we obtain an undirected graph encoding all and only the existing conditional independencies between the variables of $X$.
Here, we present the necessary and sufficient conditions for $X_i \perp \!\!\! \perp X_j |
X_{\mathbf{V}\backslash\{i,j\}}$. By definition, if $X_i$ is conditionally independent of $X_j$ given all remaining variables $X_{\mathbf{V}\backslash\{i,j\}}$, we can factor the probability density function (PDF) $p(x)$ as follows
$p(x) = p(x_i|x_{\mathbf{V}\backslash\{i,j\}})p(x_j|x_{\mathbf{V}\backslash\{i,j\}})p(x_{\mathbf{V}\backslash\{i,j\}}).$
Together with the assumption that p has continuous derivatives up to second order w.r.t. the Lebesgure measure, we have
$\frac{\partial^2 \log p(x)}{\partial x_i x_j}=0$
Conversely, the solution of the right-hand side is given by $\log p(x) = g(x_{1:i-1}, x_{i+1:d}) + h(x_{1:j-1}, x_{j+1:d})$ for some functions $g,h: \mathbb{R}^{d-1}\rightarrow\mathbb{R}$. Thus, $X_i \perp \!\!\! \perp X_j | X_{\mathbf{V}\backslash\{i,j\}}$. In order words, as long as the log-likelihood function follows this form, the conditional independencies can be inferred from the score function. Additionally, Zheng et al. [3] leverage this condition to characterize conditional independence structures in general distributions, covering various data types (continuous, discrete, and mixed-type). The additive noise model with Gaussian noise considered in our study can be regarded as a specific instance within the continuous domain. Importantly, this is unrelated to the faithfulness assumption at this point.
However, when discussing the relation between conditional independencies and the Markov blanket, faithfulness is crucial for the unicity of the Markov Blanket, as demonstrated by several studies [4,5,6,7]. Therefore, the faithfulness assumption adopted here aligns with the previous studies and is the standard one as stated in the answer to Q1.
(Faithfulness). $\mathcal{G}$ and $P$ satisfy the Faithfulness if and only if every conditional independence relation true in $P$ is entailed by the Causal Markov Condition applied to $\mathcal{G}$.
In essence, the faithfulness assumption is to ensure the connection between the conditional independencies and the Markov blanket, rather than establishing a guarantee between conditional independencies and the score function. The proof of the lemma will be revised to explicitly showcase the role of the faithfulenss assumption in the updated manuscript. Your suggestions have been invaluable in clarifying some confusing representations and further enhancing the quality of our paper.
**References**
[1] Alessio Spantini, Daniele Bigoni, and Youssef Marzouk. Inference via low-dimensional couplings.
The Journal of Machine Learning Research, 19(1):2639–2709, 2018.
[2] Rebecca Morrison, Ricardo Baptista, and Youssef Marzouk. Beyond normality: Learning sparse
probabilistic graphical models in the non-gaussian setting. Advances in neural information
processing systems, 30, 2017.
[3] Yujia Zheng, Ignavier Ng, Yewen Fan, and Kun Zhang. Generalized precision matrix for scalable
estimation of nonparametric markov networks. arXiv preprint arXiv:2305.11379, 2023.
[4] Ioannis Tsamardinos, Constantin F Aliferis, Alexander R Statnikov, and Er Statnikov. Algorithms for large scale markov blanket discovery. In FLAIRS conference, volume 2, pages 376–380. St. Augustine, FL, 2003.
[5] Ioannis Tsamardinos, Laura E Brown, and Constantin F Aliferis. The max-min hill-climbing
bayesian network structure learning algorithm. Machine learning, 65:31–78, 2006.
[6] Constantin F Aliferis, Ioannis Tsamardinos, and Alexander Statnikov. Hiton: a novel markov blanket algorithm for optimal variable selection. In AMIA annual symposium proceedings, volume 2003, page 21. American Medical Informatics Association, 2003.
[7] Roland Nilsson, José M Pena, Johan Björkegren, and Jesper Tegnér. Consistent feature selection for pattern recognition in polynomial time. The Journal of Machine Learning Research, 8: 589–612, 2007.
**ANSWER (Final version)**
Thank you so much for your thoughtful follow-up question, as it contributes significantly to enhancing the clarity and preventing potential misunderstandings in our paper for future audiences. Moreover, we express sincere gratitude for your dedicated time in reviewing the provided answers.
**[Key point: There is a dependence but it does not reflect in a non-null score.]**
*Indeed, for any pair of variables, it would lead to a non-null score if there is a dependence between them, irrespective of whether the dependence is in linear or higher-order terms.* This claim is supported by $X_i \perp \!\!\! \perp X_j | X_{\mathbf{V}\backslash\{i,j\}} \Longleftrightarrow \frac{\partial^2 \log p(x)}{\partial x_i \partial x_j}=0$, where $X_{\mathbf{V}\backslash\{i,j\}}$ comprises all other variables, exclude both $X_i$ and $X_j$. This equivalence condition has been well explored in prior research [1,2,3] (see Lemma 4.1 in [1]), while some additional assumptions are required: (1) the probability measure is not restricted to be from specific families but only needs to be strictly positive; and (2) the density has continuous derivatives up to second order w.r.t. the Lebesgue measure. The proof is elaborated in the subsequent part.
The relation between the conditional independency and the Jacobian of the score function in our work relies fundamentally on this statement. All dependencies and (conditional) independencies encoded in the data distribution can be identified by the Jacobian of the score function.
We believe that your confusion largely arises from our oversignt in elaborating this condition and clarifying additional assumptions in our previous manuscript. To enhance the quality of our paper, we will provide detailed explainations of these aspects in the revised version.
<!-- Notice that the key emphasis in utilizing the Jacobian of the score function is its capacity to recover the inherent properties of the data distribution, saying recovering the dependencies and (conditional) independencies present in the data. -->
**[External discussion: Invoking faithfulness at the level of conditional independencies will not be enough to guarantee faithfulness at the level of scores.]**
<!-- (Faithfulness). $\mathcal{G}$ and $P$ satisfy the Faithfulness if and only if every conditional independence relation true in $P$ is entailed by the Causal Markov Condition applied to $\mathcal{G}$. -->
As you metioned, the faithfulness assumption establishes an if and only if relation between the graph $\mathcal{G}$ and conditional independencies in the data distribution $P$, saying every conditional independence relation true in $P$ is entailed by the Causal Markov Condition applied to $\mathcal{G}$.
We first illustrate the connections between our method and causal graph learning in the following flowchart.
The Jacobain of the score function $\xrightarrow{Identify}$ The properties of data distribution $\xrightarrow{Recover}$ Causal relationship.
<!-- As outlined in the response to point 1, using our score method can identify all dependencies and (conditional) independencies in the data distribution. Therefore, there is no need to introduce further assumption. -->
In this paper, the utilization of the Jacobian of the score function in constructing the moral graph and removing the spouse edges is dedicated to **identifying the characteristics of the data distribution, specifically, the identification of conditional independencies** as outlined in the response to point 1. Therefore, our emphasis, focusing on the first stage.
The faithfulness assumption is employed in the second stage to **constrain the causal graph and data distribution**, ensuring the identifiability of the graph. In instances where the data distribution violates the faithfulness assumption, all methods reliant on it such as constraint-based approach would fail to identify the true graph, and this is not exclusive to our approach.
<!-- In summary, the primary purpose of the score function lies in recovering the characteristics of the data distribution, not in guaranteeing the faithfulness to the causal graph. -->
<!-- It is important to note that our contribution does not establish novel connections between the data distribution and the causal graph but utilizing the computation of the Jacobian to recover the characteristics of the data distribution. -->
Thanks again for this thoughtful question. We will incorporate the above discussion into the final version of our paper.
**[Proof of S&N condtions.]**
The findings into the relationship between conditional independence and score function aligns with pior results in the context of Markov netkworks [1,2,3]. Given a collection of random variables $X = (X_1, X_2, \cdots, X_d)$ with joint density $p(x)$, the information about conditional independencies among these variables of $X$ can be embedded in a simple undirected Markov network, denoted as $\mathcal{G}^m = (\mathbf{V}, \mathbf{E})$. Here, edges $(i,j)$ encode some sort of probabilistic interaction between the pairs of random variables $X_i$ and $X_j$. In particular, Spantini et al. [1] demonstrated how to construct a Markov graph by interpreting the conditional independence of pairs of random variables as follows:
$X_i \perp \!\!\! \perp X_j | X_{\mathbf{V}\backslash\{i,j\}} \Longleftrightarrow \frac{\partial^2 \log p(x)}{\partial x_i x_j}=0$
where $\frac{\partial^2 \log p(x)}{\partial x_i \partial x_j}$ denotes the $ij-$th entry of the Jacobian of the score. By adding edges between nodes that do not satisfy this equation, we obtain an undirected graph encoding all and only the existing conditional independencies between the variables of $X$.
Here, we present the necessary and sufficient conditions for $X_i \perp \!\!\! \perp X_j |
X_{\mathbf{V}\backslash\{i,j\}}$. By definition, if $X_i$ is conditionally independent of $X_j$ given all remaining variables $X_{\mathbf{V}\backslash\{i,j\}}$, we can factor the probability density function (PDF) $p(x)$ as follows
$p(x) = p(x_i|x_{\mathbf{V}\backslash\{i,j\}})p(x_j|x_{\mathbf{V}\backslash\{i,j\}})p(x_{\mathbf{V}\backslash\{i,j\}}).$
Together with the assumption that p has continuous derivatives up to second order w.r.t. the Lebesgure measure, we have
$\frac{\partial^2 \log p(x)}{\partial x_i x_j}=0$
Conversely, the solution of the right-hand side is given by $\log p(x) = g(x_{1:i-1}, x_{i+1:d}) + h(x_{1:j-1}, x_{j+1:d})$ for some functions $g,h: \mathbb{R}^{d-1}\rightarrow\mathbb{R}$. Thus, $X_i \perp \!\!\! \perp X_j | X_{\mathbf{V}\backslash\{i,j\}}$. In order words, as long as the log-likelihood function follows this form, the conditional independencies can be inferred from the score function. Additionally, Zheng et al. [3] leverage this condition to characterize conditional independence structures in general distributions, covering various data types (continuous, discrete, and mixed-type). The additive noise model with Gaussian noise considered in our study can be regarded as a specific instance within the continuous domain.
<!-- However, when discussing the relation between conditional independencies and the Markov blanket, faithfulness is crucial for the unicity of the Markov Blanket, as demonstrated by several studies [4,5,6,7]. Therefore, the faithfulness assumption adopted here aligns with the previous studies and is the standard one as stated in the answer to Q1.
(Faithfulness). $\mathcal{G}$ and $P$ satisfy the Faithfulness if and only if every conditional independence relation true in $P$ is entailed by the Causal Markov Condition applied to $\mathcal{G}$. -->
<!-- In essence, the faithfulness assumption is to ensure the connection between the conditional independencies and the Markov blanket, rather than establishing a guarantee between conditional independencies and the score function. The proof of the lemma will be revised to explicitly showcase the role of the faithfulenss assumption in the updated manuscript. Your suggestions have been invaluable in clarifying some confusing representations and further enhancing the quality of our paper. -->
**References**
[1] Alessio Spantini, Daniele Bigoni, and Youssef Marzouk. Inference via low-dimensional couplings. The Journal of Machine Learning Research, 19(1):2639–2709, 2018.
[2] Rebecca Morrison, Ricardo Baptista, and Youssef Marzouk. Beyond normality: Learning sparse probabilistic graphical models in the non-gaussian setting. Advances in neural information
processing systems, 30, 2017.
[3] Yujia Zheng, Ignavier Ng, Yewen Fan, and Kun Zhang. Generalized precision matrix for scalable estimation of nonparametric markov networks. arXiv preprint arXiv:2305.11379, 2023.