# Document Level T2G (Big-G)
## Introduction
Document level knowledge graph construction aims to extract a complete KG from a long document. The main challenge in this task is the long range dependencies between and entities and cross sentence reasoning. The cross sentence reasoning requires the ability of multi-hop reasoning. The following example is taken from DocRED dataset.

### The Autoregressive approach
We approach this problem with an autoregressive graph generation framework. In this framework, the model builds the KG step by step. At each step, the model reads a sentence and updates the KG. The KG grows as the model reads the sentence one by one.
First we define some notations:
| Notation | Description |
|--------|---------|
|$X$|The document|
|$X_t$|The $t$-th sentence/text span of the document |
|$T$|Number of sentences/text spans in $X$|
|$G$|The knowledge graph extracted from $X$|
|$G_t$|The graph constructed up to the $t$-th step|
|$\mathbf{G_t}$|Hidden state of $G_t$|
|$V_t$|Nodes in $G_t$|
|$E_t$|New entities in $X_t$. An entity is new means it is mentioned the first time in $X_t$|
|$M_t$|New mentions in $X_t$. The entities in it are all mentioned in $X_{<t}$|
|$\Delta G_t$|Increments of the graph at the $t$-th step. The increments may be: 1) new entities $E_t$; 2) new relations between subject $s$ and object $o$, for $s,o\in E_t \cup V_{t-1}, s \neq o$|
|$A_t$|Adjacency matrix of $G_t$|
|$\Delta A_t$|Adjacency matrix of $\Delta G_t$|
We define the autoregressive model (under Markov Assumption) as follows:
$$ \begin{align} P(G|X) &= \prod_{t=1}^T P(G_t|X_{\leq t}) \tag{1} \\
&= \prod_{t=1}^{T} P(G_t|G_{<t};X_{\leq t}) \tag{2} \\
&= \prod_{t=1}^{T} P(G_t|G_{t-1};X_{\leq t}) \tag{3}
\end{align}
$$
and $G_0$ is an empty graph.
==**Assumption 1**== Given $G_{t-1}$,$\Delta G_t$ does not depend on $X_{<t}$. In other words, mutation at $t$ only needs previous graph and current sentence, then:
$$P(G|X)=\prod_{t=1}^T P(G_t|G_{t-1};X_t) \tag{4}$$
Exceptions:
- Example: "X is a friend of Y's brother. Y's brother is Z." The first sentence doesn't imply any relations. The second sentence itself implies a relation (Z, brother_of, Y). But the two sentences togather imply a new relation (X, friend_of, Z).
- When the nessesary entities or relations are not extracted. For example "X works at company Y. Z works at the same company". The second sentence implies relation (X, colleague_of, Z). But if entity "Y" or relation (X, work_at, Y) is not labeled, the extraction must depend on the first sentence.
> - This assumption holds when all the entities and relations are labeled defined in the ontology, including all the coreferences in the document.
> - If this assumption holds, the model can first extract single-hop relations and then do coreference resolution, to construct a simple KG, and finally predict multi-hop relations on this KG.
> - **TBD**: If the relations are extracted sentence by sentence, may have conflict facts. How to define the true facts?
==**Assumption 2**== $G_t$ does not retrospectively modify $G_{t-1}$. In other words, nodes and edges are immutable once created. Then,
$$P(G|X)=\prod_{t=1}^T P(\Delta G_t|G_{t-1}; X_{\leq t}) \tag{5}$$
$$G_t = G_{t-1} \cup \Delta G_t \tag{6}$$
Exceptions:
- Remove nodes: TBD
- Remove edges: "X get a job in company Y. But he quits this job a year later."
> **TBD**: conflicts with ==**Assumption 1**==
==**Assumption 3**== The entities are given at each step, then:
$$P(G|X)=\prod_{t=1}^T P(A_t|G_{t-1};X_{\leq t}) \tag{7}$$
Discussion: this assumes that entity detection is done and is an input. The eventual system will need to do the detection as well. This also requires co-reference resolution, example: "Bob is married to Alice. He likes hunting".
==**Assumption 4**== New relations in $\Delta G_t$ are all about new entities or new mentions in $X_t$.
$$ \forall (s, r, o) \in \Delta G_t : s \vee o \in E_t\cup M_t $$
> (rewrite) Relations supported by $X_t$ as the last evidence are all related to entities in $E_t$. Example: "X is father to Y. Y is father to Z". There is a multi-hop relation "X is grandpa to Z". "A went to X for school... B is classmate to A". Then "B is schooled at X" is an inferred, multi-hop relation.
## Preliminary stepwise graph construction
We first do stepwise graph construction to verify the framework.
We do experiments on DocRED dataset. In this dataset, ==**Assumption 2, 3, 4**== hold. So we compare the results under the following settings:
1. $P(A_t|X_{\leq t})$
Refer to DocRED paper.
2. $P(\Delta A_t|X_{\leq t})$
Refer to DocRED paper.
- Under this setting, there are more negative labels than the first setting, thus could be harder to train.
- The input is $X_{\leq t}$, but the output is the increments of the graph, so I don't think this will work. We may add some additional inputs to the model.
3. $P(A_t|G_{t-1};X_{\leq t})$
4. $P(\Delta A_t|G_{t-1};X_{\leq t})$
Results:
|Setting|Step-wise F1 on dev|
|---|----|
|$P(A_t\|X_{\leq t})$||
|$P(\Delta A_t\|X_{\leq t})$||
|$P(A_t\|G_{t-1};X_{\leq t})$||
|$P(\Delta A_t\|G_{t-1};X_{\leq t})$||
<!--
## Method
### Symbols:
- Document $D = (S_1, \cdots, S_t, \cdots, S_n)$, where $S_t$ is the $t$-th sentence;
- Each sentence $S_t$ contains $m_t$ entites $E_t = (e_{t}^{(1)}, \cdots, e_{t}^{(m_t)})$. Note that each entity may have multiple mentions.
### Input and output at $t$-th step
- Inputs
- $G_{t-1}$: graph representation constructed with $S_{<t}$. $n_i$ is the $i$-th node in $G_{t-1}$
- $S_t$
- $E_t$: may have overlap with the entities in $G_{t-1}$, i.e. $E_{<t}$.
- Output
- New relations supported by $S_t$ as the *last evidence*, represented with an adjacency matrix $A_t$
- Requirements
- Entities and their mentions are extracted beforehand.
- Each relation is annotated with supporting evidences. Supporting evidences are the sentences supports the relation.
### The Basic Algorithm
Inputs at $t$-th step:
- $G_{t-1}=(v_1^{t-1},\dots,v_n^{t-1})$, $v_i^{t-1}\in V^{t-1}$
- $S_1,\dots, S_{t-1}$
- $S_t$, with $k$ entities $\Delta V^t =({\Delta v_1^t, \dots, \Delta v_k^t})$
Algorithm at $t$-th step:
1. $\mathbf{S_1}, \dots, \mathbf{S_{t-1}}=BiLSTM(S_1), \dots, BiLSTM(S_{t-1})$
2. $\mathbf{v_i^{t-1}}=Avg(\text{mentions of } v_i^{t-1} \text{ in } \mathbf{S_1}, \dots, \mathbf{S_{t-1}})$
3. $V^t=V^{t-1} \bigcup \Delta V^t$
4. $\mathbf{V^{t-1}}, \Delta \mathbf{V^t} = Encoder(\mathbf{G_{t-1}}, S_t)$, contains interactions between text and graph
5. $\mathbf{V^t}=\mathbf{V^{t-1}}\bigcup \Delta \mathbf{V^t}$
$\mathbf{v_i^t}=\begin{equation} \left\{ \begin{array}{lr} \mathbf{v_i^{t-1}}+\Delta \mathbf{v_j^t}, \text{if } \Delta v_j^t=v_i^{t-1}, \\ \Delta \mathbf{ v_i^{t}}, \text{otherwise} \end{array} \right. \end{equation}$
5. $A_t = Classifier(\mathbf{V^t})$
### Autoregressive Graph Generation (also related to Temporal Graph Network)
1. Sentence encoding: $\mathbf{S_t} = Encoder(S_t, G_{t-1})$
- This encoder may have a LSTM/Transformer with attention on $G_{t-1}$
2. Get representations of entities in $S_t$.
- For each entity in $E_t$, sum their word representations in $\mathbf{S_t}$
3. Update nodes in $G_{t-1}$
- Update isolated nodes with $\mathbf{S_t}$, e.g. GRU
- Add new entities to $G_{t-1}$
7. Update nodes with new mentions, e.g. GRU
- Message passing on $G_{t-1}$
9. Link prediction: $A_t = Classifier(G_{t-1})$
10. $G_t = Merge(A_t, G_{t-1})$
11. Message passing on $G_{t}$
-->
## Datasets
#### DocRED
This dataset is annotated with:
- Entities: each entity may have multiple mentions accross the document
- Relations: each relation is a triple (h, r, t), as well as the supporting evidences. The supporting evidences are sentences in the document.
Let's call the last sentence in the evidences 'the last evidence'. In our framework, we use the last evidence to construct the training data. However, 2% of the last evidence don't contain h or t, so we remove this evidence in preprocessing.
A few documents don't have any relations, we remove them as well. Lastly we have 3020 documents in training set and 984 in dev.
##### Coreference Resolution
We augment the dataset through coreference resolution by using **AllenNLP**. We find if existing entites have coreference. If any, we will add those that are not in original training dataset. Note that we don't add new entity but just add new mention for existing entity.
For example, as shown in the paragraph below, "Nicola Lake Indian Reserve" and "its" is the same entity. However, in the original training dataset, "its" mention is missing. After coreference reslution, "its" is added.
**sent_2:** *The lake is important in the history of the local Nicola people as the location of one of their major communities, ==Nicola Lake Indian Reserve== No.*
**sent_3:** *1, which lies on ==its== eastern shore and is the home of the Upper Nicola Indian Band.*

As a result, we add 4434 mentions in training set in total.
#### SciREX
NL2Sql
Math problem
Meeting notes sum
### Using synthetic dataset
If we have a big knowledge graph $G$, the generation of synthetic dataset follows this procedure:
1. If we are about to generate the first text $T_1$, sample a sub graph $g_1$ from $G$, do regular G2T
2. At the $i$-th generation step, we have generated text $T_{<i}$ and the sub graph used so far $G_{i-1}$. Then sample a sub graph $g_i$ from $G$, subject to: 1) $g_i \cap G_{i-1} \neq \emptyset$; 2) $g_i \not\subset G_{i-1}.$
3. $T_i = f(g_i, T_{<i}, G_{i-1})$