# NLP (Coreference Resolution) Preparatory Reading Note
###### tags: `NLP` `Coref Resolution`
## List of Preparatory Reading Materials
Current preparatory reading material for beginners in Natural Language Processing (NLP) given a background in basic Machine learning (ML). Note that mentioned reseach papers are intended for people looking to work on coreference resolution, but may covers basic concepts for all of NLP.
### Textbooks
- [**Speech and Language Processing**](https://web.stanford.edu/~jurafsky/slp3/), Dan Jurafsky and James H. Martin, *3rd ed.*
### Reseaches
- [**End-to-end Neural Coreference Resolution**](https://www.aclweb.org/anthology/D17-1018/), Lee et al., 2017.
- [**Higher-order Coreference Resolution with Coarse-to-fine Inference**](https://www.aclweb.org/anthology/N18-2108/), Lee et al., 2018.
- [**BERT for Coreference Resolution: Baselines and Analysis**](https://www.aclweb.org/anthology/D19-1588/), Joshi et al., 2019.
## Reading Note
### Vector Representations and Embedding
Detailed summary note: [**Vector Representations and Embedding**](https://hackmd.io/@bachnguyenTE/NLP_embeddings).
---
### Deep Learning Architectures for Sequence Processing
Detailed summary note: [**Deep Learning Architectures for Sequence Procesing**](https://hackmd.io/@bachnguyenTE/ML_SequenceProc).
---
### Machine Translation and Encoder-Decoder Models
Detailed summary note: [**Machine Translation and Encoder-Decoder Models**](https://hackmd.io/pBHKp0LNTWa8cEx_pwpi_w)
---
### Constituency Grammars
Detailed summary note: [**Constituency Grammars**](https://hackmd.io/@bachnguyenTE/consti_grammars)
---
### Coreference Resolution
From [**Chapter 22: Coreference Resolution**](https://web.stanford.edu/~jurafsky/slp3/22.pdf) in **Speech and Language Processing**, Dan Jurafsky and James H. Martin, *3rd ed.*
#### Introduction to Coreference Resolution
> ++Albert Einstein++ was a ++German-born theoretical physicist++ who developed the theory of relativity, one of the two pillars of modern physics. ++His++ work is also known for its influence on the philosophy of science.
- Linguistic expressions such as *Albert Einstein* or *his* are called **mentions** or **referring expressions**, and the discourse entity that is referred to (Albert Einstein) the **referent**.
- Two or more referring expressions that are used to refer to the same discourse entity are said to **corefer** (such as *Albert Einstein* and *his*).
- A **discourse model** is a mental model that the system (or a human hearer) builds incrementally as it interprets a text;
- A model contains representations of the entities referred to in the text, as well as the properties of the entities and relations among them.
- When a referrent is first mentioned in a discourse, a representation for it is **evoked** into the model. Upon subsequent mention, this representation is **accessed** from the model.
- Reference in a text to an entity that *has been previously introduced* into the discourse is called **anaphora**,
- The referring expression used is said to be an **anaphor**/anaphoric.
- The anaphor corefer with a prior mention that is called the **antecedent**,
- Not every referring expression is an antecedent. An entity that has only a single mention in a text is called a **singleton**.
- **Coreference resolution** is the task of determining whether two mentions *corefers*, by which we mean they refer to the same entity in the discourse model (the same *discourse entity*).
- The set of corefering expressions is often called a **coreference chain** or a **cluster**.
- **Entity linking**, or *entity resolution*, is the task of mapping a discourse entity to some real-world individual.
---
### End-to-end Neural Coreference Resolution
- The approach mentioned in the paper does not require existing resources such as syntactic parsers.
- *Syntactic parsing* (what is it?) as input to hand-engineered mention proposal algorithms are not needed.
- Their solution is an end-to-end neural net that learn both which spans are entity mentions and how to best cluster them.
- The model is a span-ranking and optimizing system that consider a space of all spans (up to a length) and determine if any span is a likely antecedent to another.
- *Span-ranking* model that determine if any of the previous span is a good antecedent.
- The core of the model are therefore vector embeddings representing spans of text in a document.
- Scoring all span pairs in the model is impractical due to the sheer number of spans.
- The model is therefore factored over unary mention scores and pairwise antecedent scores.
- *Unary mention scores* are used to prune the space of spans and antecedents.
#### Related Work
- Previous high-performance models with head-parsers and heavily-engineered mention proposal algorithms pipelines have two major drawbacks:
- *Cascading errors* caused by parsing mistakes, and
- *Hand-engineered rules do not generalize* to new languages or domains.
- The method described in the study produces non-pipelined results, with a span-ranking approach that is most similar to *mention ranking*, while reasoning over a larger space by jointly detecting mentions and predicting coreference.
#### Task
- **Task**: Assigning each span $i$ in the document $D$ an antecedent $y_i$.
#### Model
- The model aim to learn a conditional probability distribution of all antecedents given the document $D$.
- For each span, a series of antecedents are considered and calculated with a pairwise score $s(i,j)$ for a reference link between span $i$ and span $j$ in $D$.
- There are three factors for the pairwise coreference score:
- Whether span $i$ is a mention ($s_m(i)$),
- Whether span $j$ is a mention ($s_m(j)$), and
- Whether $j$ is an antecedent of $i$ ($s_a(i, j)$).
- The above factoring enables aggressive pruning of spans that are unlikely to belong to a coreference cluster according to the mention score $s_m(i)$.
- **Scoring Architecture**: At the core of the scoring model are vector representations $g_i$ for each possible span $i$.
- The scoring functions are computed via standard feedforward neural networks:
\begin{align}
s_m(i) &= w_m \cdot \text{FFNN}_m(g_i) \\
s_a(i,j) &= w_a \cdot \text{FFNN}_a([g_i, g_j, g_i \odot g_j, \phi(i,j)])
\end{align}
- $\phi(i,j)$ is a feature vector encoding speaker and genre information from the metadata and the distance between two spans.
- **Span Representations**:
- Two types of information are crucial to accurately predicting coreference links:
- The context surrounding the mention span, and
- The internal structure within the span.
- These information is represented by the *boundary representations*.
- The model uses a bidirectional LSTM to encode the lexical information of both the inside and outside of each span.
- First, use bidirectional LSTMs to *encode every word* in its context.
\begin{align}
f_{t, \delta} &= \sigma(W_f[x_t, h_{t+\delta,\delta}] +b_i) \\
o_{t, \delta} &= \sigma(W_o[x_t, h_{t+\delta,\delta}] +b_o) \\
\tilde{c}_{t, \delta} &= \tanh(W_c[x_t, h_{t+\delta,\delta}] +b_c) \\
c_{t, \delta} &= f_{t, \delta} \cdot \tilde{c}_{t, \delta} + (1 - f_{t, \delta}) \cdot c_{t+\delta, \delta} \\
h_{t, \delta} &= o_{t, \delta} \cdot \tanh(c_{t, \delta}) \\
x^*_t &= [h_{t,1}, h_{t, -1}]
\end{align}
- Independent LSTMs are used for every sentence, since cross-sentence context was not helpful.
- Instead of relying on syntactic parses to find head, the model includes an *attention mechanism over words in each span* to model head words:
\begin{align}
\alpha_t &= w_\alpha \cdot \text{FFNN}_\alpha(x^*_t) \\
a_{i, t} &= \frac{\exp(\alpha_t)}{\sum^{\text{END}(i)}_{k = \text{START}(i)}\exp(\alpha_k)} \\
\hat{x}_i &= \sum^{\text{END}(i)}_{k = \text{START}(i)} a_{i,t} \cdot x_t
\end{align}
- $\hat{x}_i$ is a weighted sum of word vectors in span $i$.
- The weights $a_{i,t}$ are automatically learned and correlate strongly with traditional definitions of head words.
- The attention mechanism is not comparing any two entities, rather it takes the unary score of each word and see how important it is to the rest of the span.
- The above span information is concatenated to produce the final representation $g_i$ of span $i$:
\begin{align}
g_i = [x^*_{\text{START}(i)}, x^*_{\text{END}(i)}, \hat{x}_i, \phi(i)]
\end{align}
- $x^*_{\text{START}(i)}, x^*_{\text{END}(i)}$ are boundary representations.
- $\hat{x}_i$ is the soft head word vector, and
- $\phi(i)$ is a feature vector encoding the size of span $i$.
#### Inference
- The size fo the full model described above is $O(T^4)$ in the document length $T$.
- To maintain efficiency, candidate spans are pruned greedily during both training and evaluation.
- The model only consider spans with up to $L$ words and compute their unary mention scores $s_m(i)$.
- The model only keeps up to $\theta T$ spans with the highest mention scores and consider only up to $K$ antecedents for each.
- The spans are accepted in decreasing order of the mention score unless if span $i$ under consideration overlaps an already accepted span $j$.
- For the remaining mentions, the joint distribution of antecedents for each document is computed in a forward pass over a single computation graph.
- The final prediction is the clustering produced by the most likely configuration.
#### Learning
- The learning process optimize the likelihood of all correct antecedents implied by the gold clustering:
\begin{align}
\log \prod^N_{i=1} \sum_{\hat{y} \in \mathcal{Y}(i) \cap \text{GOLD}(i)} P(\hat{y})
\end{align}
- Basically, considering all spans in a document,
- For each span, you sum up the probabilities of its potentially correct antecedents.
- If the sum is 1, the model is not incurring any loss.
- Related to Durett & Klein (2013).
- $\text{GOLD}(i)$ is the set of spans in the gold clustering containing span $i$.
- If span $i$ does not belong to a gold cluster or all gold antecedents have been pruned, $\text{GOLD}(i) = \{\epsilon\}$
- By optimizing the above objective, the model naturally learns to prune spans accurately.
- Only gold mentions receive positive updates after the initial random pruning.
- The model quickly leverages the learning signal for appropriate credit assignment to the different factors, such as the mention score $s_m$ used for pruning.
- Note that if a span is either not a mention or have no good antecedent, it is assigned a dummy antecedent $\epsilon$ signalling that it is not in any coreferent relationship.
---
### Higher-order Coreference Resolution with Coarse-to-fine Inference
- Recent coreference resolution systems have heavily relied on **first-order models**, where only pairs of entity mentions are scored by the model.
- That includes the recent *e2e-coref* solution (`Lee et al., 2017`).
- These moddels are computationally efficient and scalable to long documents.
- However, because they make independent decisions about coreference links, they are susceptible to predicting clusters that are *locally consistent but globally inconsistent*.
- The new higher-order inference uses the span-ranking architecture from `Lee et al. (2017)` in an iterative manner.
- At each iteration, the **antecedent distribution is used as an attention mechanism** to optionally update span representations,
- Enabling later coreference decisions to softly condition on earlier coreference decisions.
- The new model also includes a **coarse-to-fine approach** that is learned with the objective, enabling an extra pruning step during inference.
- This reduces the number of antecedents considered by the more accurate but inefficient fine factor.
#### Background
- The coreference resolution task formulation is identical to the mechanisms described in `Lee et al. (2017)` above.
- That is, spans are scored and ranked based on singular mention score and pairwise mention scores with any earlier antecedent.
- The baseline model is factored to enable a two-stage beam search.
- A beam of up to $M$ potential mentions is computed based on the spans with the highest mention scores $s_m(i)$.
- Pairwise coreference scores are only computed between surviving mentions during both training and inference.
- The baseline model is a first-order model, since it only considers pairs of spans.
#### Higher-order Coreference Resolution
- First-order models are susceptible to consistency errors.
- Modeling higher-order decisions at the document-level requires explicit inference.
- This is due to the potentially very larger surface distance between mentions.
- The new higher-order inference procedure address these problem while leaving the model differentiable.
- This inference involves $N$ iterations of *refining span representations*.
- $\pmb{g}^n_i$ denotes the representation of span $i$ at iteration $n$.
- At iteration $n$, $\pmb{g}^n_i$ is computed with an attention mechanism that averages over previous representations of $\pmb{g}^{n-1}_j$ weighted according to how likely each mention $j$ is to be an antecedent to $i$.
- The baseline model is used to initialize the span representation at $\pmb{g}^1_i$.
- The refined span representations allow the model to also iteratively refine the antecedent distribution $P_n(y_i)$, where $s$ is the coreference scoring function:
\begin{align}
P_n(y_i) = \frac{e^{s(\pmb{g}^n_i, \pmb{g}^n_{y_i})}}{\sum_{y \in \mathcal{Y}(i)} e^{s(\pmb{g}^n_i, \pmb{g}^n_y)}}
\end{align}
- The scoring function uses the same parameters at every iteration, but it is given different span representations.
- At each iteration, we first compute the expected antecedent representation $\pmb{a}^n_i$ of each span $i$ by using the current antecedent distribution $P_n(y_i)$ as an **attention mechanism**:
\begin{align}
\pmb{a}^n_i = \sum_{y_i \in \mathcal{Y}(i)} P_n(y_i) \cdot \pmb{g}^n_{y_i}
\end{align}
- In other words, the model integrate new information from the expected antecedents distribution of a span to refine the span itself.
- The current span representation $\pmb{g}^n_i$ is then updated via interpolation with its expected antecedent representation $\pmb{a}^n_i$:
\begin{align}
\pmb{f}^n_i &= \sigma(\textbf{W}_f [\pmb{g}^n_i, \pmb{a}^n_i]) \\
\pmb{g}^{n+1}_i &= \pmb{f}^n_i \odot \pmb{g}^n_i + (1 - \pmb{f}^n_i) \odot \pmb{a}^n_i
\end{align}
- The learned gate vector $\pmb{f}^n_i$ determines for each dimension wheter to keep the current span information, or to integrate new information from its expected antecedent.
- At iteration $n$, $\pmb{g}^n_i$ is an element-wise weighted average of approximately $n$ span representations (assuming $P_n(y_i)$ is peaked), allowing $P_n(y_i)$ to softly condition on up to $n$ other spans in the predicted cluster.
- Span-ranking can be viewed as prediction latent antecedent trees, where the predicted antecedent is the parent of a span and each tree is a predicted cluster.
- Another way to interpret this model, with iterative span representations and antecedent distributions refinement, is that the joint distribution $\prod_i P_N(y_i)$ implicitly models every directed path of up to length $N + 1$ in the latent antecedent tree.
#### Coarse-to-fine Inference
- The model above scales pooly to long documents, bottenecked by the pairwise antecedent score computation which is in polynomial matrix size.
- Pairwise antecedent matrix size $M \times M \times (3|\pmb{g}| + |\phi|)$.
- The computational challenge is made worse with the iterative inference process.
##### Heuristic antecedent pruning
- The original paper consider only the nearest $K$ antecedents of each span, reducing matrix size to $M \times K \times (3|\pmb{g}| + |\phi|)$.
- The main drawback is that this imposes an a priori limit on the maximum distance of a coreference link.
##### Coarse-to-fine antecedent pruning
- The key component of the coarse-to-fine approach that does not establish a maximum coreference distance is an **alternate bilinear scoring function**:
\begin{align}
s_c(i, j) = \pmb{g}^\top_i \textbf{W}_c \pmb{g}_j
\end{align}
- $\textbf{W}_c$ is a learned weight matrix.
- The bilinear $s_c(i, j)$ is far less accurate than the concatenation-based $s_a(i, j)$, but is much more efficient to compute (matrix size $M \times |\pmb{g}|$ and $M \times M$).
- The above bilinear function $s_c(i, j)$ is used to compute a rough sketch of *likely* antecedent, by including it as an additional factor in the model:
\begin{align}
s(i, j) = s_m(i) + s_m(j) + s_c(i, j) + s_a(i, j)
\end{align}
- This aditional factor is leveraged as an additional beam pruning step.
- The final inference procedure involves a three-stage beam search:
- **First stage**: Keep the top $M$ spans based on mention score $s_m(i)$ of each span.
- **Second stage**: Keep the top $K$ antecedents of each remaining span $i$ based on the first three factors, $s_m(i) + s_m(j) + s_c(i, j)$.
- **Third stage**: The overall coreference $s(i, j)$ is computed based on the remaining span pairs, with the soft higher-order inference procedure described previously.
- While the maximum-likelihood objective is computed over only the span pairs from the final stage, this coarse-to-fine approach expands the set of coreference links that the model can learn.
- It achieves bettter performance while using a much smaller $K$ than heuristic pruning.
---
### BERT for Coreference Resolution: Baselines and Analysis
- Recent BERT-based models have reported dramatic gains on multiple semantic benchmarks.
- One of BERT's major improvements over previous methods is passage-level training, which allows it to *better model longer sequences*.
- In the paper, BERT is fine-tuned to coreference resolution, achieving strong improvements on coref benchmarks.
- There are two ways of extending the c2f-coref model in `Lee et al. (2018)` with BERT.
- The *independent* variant uses non-overlapping segments each of which acts as an independent instance for BERT.
- The *overlap* variant splits the document into overlapping segments so as to provide the model with context beyond 512 tokens.
- Analysis shows that BERT-large is remarkably better at distinguishing related yet distinct entities or concepts (unlike BERT-base).
- However, both models struggle to resolbe coreferences for cases that require world knowledge.
- Likewise, modeling pronouns remains difficult, especially in conversations.
- BERT-large also benefits from using longer context windows (384 word pieces), while BERT-base performs better with shorter contexts (128 word pieces). Both variants perform much worse with longer context windows (512 tokens) in spite of being trained on 512-size contexts.
- The *overlap* variant with extended context window (beyond 512) provides no improvement.
- This indicates that using larger context windows for pretrining might not translate into effective long-range features for a downstream task.
- Larger models also exacerbate the memory-intensive nature of span representations (`Lee et al., 2017`).
- In conclusion, there is still room for improvement in modeling document-level context, conversations, and mention paraphrasing.
#### Method
- For the experiments, we use the higher order coreference model in `Lee et al. (2018)`, referred to as *c2f-coref*.
- A description of *c2f-coref* is found in the paper summary above.
- To apply BERT, the paper replaces the entire LSTM-based encoder (with ELMo and GloVe embeddings as input) in c2f-coref with the BERT transformer.
- The first and last word-pieces (concatenated with the attended version of all the word pieces in the span) are treated as span representations.
- The paper experiment with two variants of splitting:
- **Independent**: The *independent* variant uses non-overlapping segments each of which acts as an independent instance for BERT.
- As BERT is trained on sequences of 512 word pieces, this variant has limited encoding capacity especially for end of segment tokens.
- **Overlap**: The *overlap* variant splits the document into overlapping segments by creating a $T$-sized segment after every $T/2$ tokens.
- These segments are then passed on to the BERT encoder independently, and
- The final token representation is derived by element-wise interpolation of representations from both overlapping segments.
- Let $\pmb{r_1} \in \mathbb{R}^d$ and $\pmb{r_2} \in \mathbb{R}^d$ be the token representations from the overlapping BERT segments.
- The final representations $\pmb{r} \in \mathbb{R}^d$ is given by:
\begin{align}
\pmb{f} &= \sigma(\pmb{w}^T[\pmb{r_1}; \pmb{r_2}]) \\
\pmb{r} &= \pmb{f} \cdot \pmb{r_1} + (\pmb{1} - \pmb{f}) \cdot \pmb{r_2}
\end{align}
- $\pmb{w} \in \mathbb{R}^{2d \times d}$ is a trained parameter and $[;]$ represent concatenations.
- This variant allows the model to artifically increase the context window beyond the `max_segment_len` hyperparameter.
- All layers in both model variants are then finetuned following [`Devlin et al. (2019)`](https://www.aclweb.org/anthology/N19-1423/).
#### Analysis
##### Strengths
- There is no qualitative and quantitative differences between ELMo and BERT-base models.
- However, BERT-large improves over BERT-base in several ways including pronoun resolution and lexical matching.
- It is better at distinguishing related, but distinct, entities.
##### Weaknesses
- Better modeling of document-level context, conversations, and entity paraphrasing might further improve the state of the art.
- The models perform distinctly worse on longer documents.
- Larger models might better encode longer context, but are memory-intensive due to span representations.
- Future research in pretraining methods should look at more *effectively* encoding document-level context using sparse representations.
- Modeling pronouns (especially in conversations) continues to be difficult for all models.
- This perhaps partly because c2f-coref does verry little to *model dialog structure of the document*.
- A considerable number of errors also suggest that models are still unable to resolve cases requiring *mention paraphrasing*.