# Edge Correction in Bayesian Networks With An Expert Prior
## Definitions
Refs: [Introduction to Graphical Models (Murphy, 2001)](https://www.cs.ubc.ca/~murphyk/Papers/intro_gm.pdf)
A **probabilistic graphical model** is a graph where nodes are random variables and edges are conditional representations.
A **Bayesian network** or **directed graphical model** (Pearl, 1988) is a type of probabilistic graphical model where directionality of edges account for independence between nodes. Each node must specify a **conditional probability distribution (CPD)** $p(x_i | \textbf{x}_{\pi(i)})$ which relates node $x_i$ to parents $\pi(i)$.
To specify a Bayesian Network we need:
1. graph topology or DAG (structure)
2. parameters of each CPD (nodes)
## Problem Statement
Suppose we are given the true graph topology and distribution on edges from a domain expert, we aim to:
1) correctly identify misspecified edges
2) make the minimal number of changes to correct the graph <mark> (what does it mean to correct the edge? do we want the best fitting minimal representation?) </mark>
**Full Observability:** In the simple case where we have all nodes observed, then the problem reduces to fitting a collection of probabilistic regression models on each edge, using their conditional-independent relationships, and checking the residuals produced. Edges that have models fit with large residuals are a natural indicator of model misspecification.
**Partial Observability:** When we have a group of nodes that are latent/unobserved, we can use other methods such as Monte Carlo methods or the EM algorithm to fit and obtain parameter values for the joint probability distribution (the whole graph). However, *we can only identify misspecified paths of edges and not singular edges*.
## A Simple Example
We have a Bayesian network whose true graph is a DAG with all linear edges. We assume only partial observability of nodes and aim to learn the parameters of each CDF.
Suppose we have the following Markov Chain as the ground truth model:

and the following true CPD (conditional probability density) for each node:
* $f_{Y} = w_1X + w_0 + \epsilon$
* $f_{Z}: w_3Y^2 + w_2 + \epsilon$
where $\epsilon \sim \mathcal{N}(0, 1)$.
In this experiment, $Y$ is unobserved.
*The expert gives us the same graph but with $f_Z$ misspecified as linear.*
## A Naive GP Idea
For all edges $e \in G(V, E)$, we assign it the following distribution:
$$
y_e := x_e^T\beta + GP(0, K_{\theta}(\cdot, \cdot))
$$
where $GP(0, K_{\theta}(\cdot, \cdot))$ denotes a 0 mean Gaussian Process (GP) with kernel function $K$ and hyperparameters $\theta$, and $x_e, y_e$ are the input and output of edge $e$.
We know that by definition of the GP, $y_e \sim \mathcal{N}(x_e^T\beta, K_{\theta} + \sigma_nI)$ where $\sigma_n$ is the noise variance. By definition of the pdf of a multivariate normal,
$$
\log (y_e | x_e, \beta, \theta) = -\frac{1}{2}(y_e - x_e^T\beta)^T[K_{\theta} + \sigma_nI]^{-1}(y_e - x_e^T\beta) - \frac{1}{2}\log \det|K_{\theta} + \sigma_nI)| - \frac{n}{2}\log2\pi
$$
We want to jointly optimize $\theta, \beta$ and compute:
$$
\theta^*, \beta^* = \arg \max_{\theta, \beta} = \log p(y_e | x_e, \beta, \theta)
$$
**Conjecture: We believe that if an edge is misspecified, when optimizing the data log likelihood, we will see a large values for the variance $\sigma_f$ hyperparameter. If an edge if not misspecified (i.e. the linear component is robust enough to fit the data with small residuals, then $\sigma_f$ must be small or has gone to 0.**
<!-- ## A Naive Idea
Can we approximate the unobserved variable Y to get residuals? -->