# NLP (Coreference Resolution) Preparatory Reading Note ###### tags: `NLP` `Coref Resolution` ## List of Preparatory Reading Materials Current preparatory reading material for beginners in Natural Language Processing (NLP) given a background in basic Machine learning (ML). Note that mentioned reseach papers are intended for people looking to work on coreference resolution, but may covers basic concepts for all of NLP. ### Textbooks - [**Speech and Language Processing**](https://web.stanford.edu/~jurafsky/slp3/), Dan Jurafsky and James H. Martin, *3rd ed.* ### Reseaches - [**End-to-end Neural Coreference Resolution**](https://www.aclweb.org/anthology/D17-1018/), Lee et al., 2017. - [**Higher-order Coreference Resolution with Coarse-to-fine Inference**](https://www.aclweb.org/anthology/N18-2108/), Lee et al., 2018. - [**BERT for Coreference Resolution: Baselines and Analysis**](https://www.aclweb.org/anthology/D19-1588/), Joshi et al., 2019. ## Reading Note ### Vector Representations and Embedding Detailed summary note: [**Vector Representations and Embedding**](https://hackmd.io/@bachnguyenTE/NLP_embeddings). --- ### Deep Learning Architectures for Sequence Processing Detailed summary note: [**Deep Learning Architectures for Sequence Procesing**](https://hackmd.io/@bachnguyenTE/ML_SequenceProc). --- ### Machine Translation and Encoder-Decoder Models Detailed summary note: [**Machine Translation and Encoder-Decoder Models**](https://hackmd.io/pBHKp0LNTWa8cEx_pwpi_w) --- ### Constituency Grammars Detailed summary note: [**Constituency Grammars**](https://hackmd.io/@bachnguyenTE/consti_grammars) --- ### Coreference Resolution From [**Chapter 22: Coreference Resolution**](https://web.stanford.edu/~jurafsky/slp3/22.pdf) in **Speech and Language Processing**, Dan Jurafsky and James H. Martin, *3rd ed.* #### Introduction to Coreference Resolution > ++Albert Einstein++ was a ++German-born theoretical physicist++ who developed the theory of relativity, one of the two pillars of modern physics. ++His++ work is also known for its influence on the philosophy of science. - Linguistic expressions such as *Albert Einstein* or *his* are called **mentions** or **referring expressions**, and the discourse entity that is referred to (Albert Einstein) the **referent**. - Two or more referring expressions that are used to refer to the same discourse entity are said to **corefer** (such as *Albert Einstein* and *his*). - A **discourse model** is a mental model that the system (or a human hearer) builds incrementally as it interprets a text; - A model contains representations of the entities referred to in the text, as well as the properties of the entities and relations among them. - When a referrent is first mentioned in a discourse, a representation for it is **evoked** into the model. Upon subsequent mention, this representation is **accessed** from the model. - Reference in a text to an entity that *has been previously introduced* into the discourse is called **anaphora**, - The referring expression used is said to be an **anaphor**/anaphoric. - The anaphor corefer with a prior mention that is called the **antecedent**, - Not every referring expression is an antecedent. An entity that has only a single mention in a text is called a **singleton**. - **Coreference resolution** is the task of determining whether two mentions *corefers*, by which we mean they refer to the same entity in the discourse model (the same *discourse entity*). - The set of corefering expressions is often called a **coreference chain** or a **cluster**. - **Entity linking**, or *entity resolution*, is the task of mapping a discourse entity to some real-world individual. --- ### End-to-end Neural Coreference Resolution - The approach mentioned in the paper does not require existing resources such as syntactic parsers. - *Syntactic parsing* (what is it?) as input to hand-engineered mention proposal algorithms are not needed. - Their solution is an end-to-end neural net that learn both which spans are entity mentions and how to best cluster them. - The model is a span-ranking and optimizing system that consider a space of all spans (up to a length) and determine if any span is a likely antecedent to another. - *Span-ranking* model that determine if any of the previous span is a good antecedent. - The core of the model are therefore vector embeddings representing spans of text in a document. - Scoring all span pairs in the model is impractical due to the sheer number of spans. - The model is therefore factored over unary mention scores and pairwise antecedent scores. - *Unary mention scores* are used to prune the space of spans and antecedents. #### Related Work - Previous high-performance models with head-parsers and heavily-engineered mention proposal algorithms pipelines have two major drawbacks: - *Cascading errors* caused by parsing mistakes, and - *Hand-engineered rules do not generalize* to new languages or domains. - The method described in the study produces non-pipelined results, with a span-ranking approach that is most similar to *mention ranking*, while reasoning over a larger space by jointly detecting mentions and predicting coreference. #### Task - **Task**: Assigning each span $i$ in the document $D$ an antecedent $y_i$. #### Model - The model aim to learn a conditional probability distribution of all antecedents given the document $D$. - For each span, a series of antecedents are considered and calculated with a pairwise score $s(i,j)$ for a reference link between span $i$ and span $j$ in $D$. - There are three factors for the pairwise coreference score: - Whether span $i$ is a mention ($s_m(i)$), - Whether span $j$ is a mention ($s_m(j)$), and - Whether $j$ is an antecedent of $i$ ($s_a(i, j)$). - The above factoring enables aggressive pruning of spans that are unlikely to belong to a coreference cluster according to the mention score $s_m(i)$. - **Scoring Architecture**: At the core of the scoring model are vector representations $g_i$ for each possible span $i$. - The scoring functions are computed via standard feedforward neural networks: \begin{align} s_m(i) &= w_m \cdot \text{FFNN}_m(g_i) \\ s_a(i,j) &= w_a \cdot \text{FFNN}_a([g_i, g_j, g_i \odot g_j, \phi(i,j)]) \end{align} - $\phi(i,j)$ is a feature vector encoding speaker and genre information from the metadata and the distance between two spans. - **Span Representations**: - Two types of information are crucial to accurately predicting coreference links: - The context surrounding the mention span, and - The internal structure within the span. - These information is represented by the *boundary representations*. - The model uses a bidirectional LSTM to encode the lexical information of both the inside and outside of each span. - First, use bidirectional LSTMs to *encode every word* in its context. \begin{align} f_{t, \delta} &= \sigma(W_f[x_t, h_{t+\delta,\delta}] +b_i) \\ o_{t, \delta} &= \sigma(W_o[x_t, h_{t+\delta,\delta}] +b_o) \\ \tilde{c}_{t, \delta} &= \tanh(W_c[x_t, h_{t+\delta,\delta}] +b_c) \\ c_{t, \delta} &= f_{t, \delta} \cdot \tilde{c}_{t, \delta} + (1 - f_{t, \delta}) \cdot c_{t+\delta, \delta} \\ h_{t, \delta} &= o_{t, \delta} \cdot \tanh(c_{t, \delta}) \\ x^*_t &= [h_{t,1}, h_{t, -1}] \end{align} - Independent LSTMs are used for every sentence, since cross-sentence context was not helpful. - Instead of relying on syntactic parses to find head, the model includes an *attention mechanism over words in each span* to model head words: \begin{align} \alpha_t &= w_\alpha \cdot \text{FFNN}_\alpha(x^*_t) \\ a_{i, t} &= \frac{\exp(\alpha_t)}{\sum^{\text{END}(i)}_{k = \text{START}(i)}\exp(\alpha_k)} \\ \hat{x}_i &= \sum^{\text{END}(i)}_{k = \text{START}(i)} a_{i,t} \cdot x_t \end{align} - $\hat{x}_i$ is a weighted sum of word vectors in span $i$. - The weights $a_{i,t}$ are automatically learned and correlate strongly with traditional definitions of head words. - The attention mechanism is not comparing any two entities, rather it takes the unary score of each word and see how important it is to the rest of the span. - The above span information is concatenated to produce the final representation $g_i$ of span $i$: \begin{align} g_i = [x^*_{\text{START}(i)}, x^*_{\text{END}(i)}, \hat{x}_i, \phi(i)] \end{align} - $x^*_{\text{START}(i)}, x^*_{\text{END}(i)}$ are boundary representations. - $\hat{x}_i$ is the soft head word vector, and - $\phi(i)$ is a feature vector encoding the size of span $i$. #### Inference - The size fo the full model described above is $O(T^4)$ in the document length $T$. - To maintain efficiency, candidate spans are pruned greedily during both training and evaluation. - The model only consider spans with up to $L$ words and compute their unary mention scores $s_m(i)$. - The model only keeps up to $\theta T$ spans with the highest mention scores and consider only up to $K$ antecedents for each. - The spans are accepted in decreasing order of the mention score unless if span $i$ under consideration overlaps an already accepted span $j$. - For the remaining mentions, the joint distribution of antecedents for each document is computed in a forward pass over a single computation graph. - The final prediction is the clustering produced by the most likely configuration. #### Learning - The learning process optimize the likelihood of all correct antecedents implied by the gold clustering: \begin{align} \log \prod^N_{i=1} \sum_{\hat{y} \in \mathcal{Y}(i) \cap \text{GOLD}(i)} P(\hat{y}) \end{align} - Basically, considering all spans in a document, - For each span, you sum up the probabilities of its potentially correct antecedents. - If the sum is 1, the model is not incurring any loss. - Related to Durett & Klein (2013). - $\text{GOLD}(i)$ is the set of spans in the gold clustering containing span $i$. - If span $i$ does not belong to a gold cluster or all gold antecedents have been pruned, $\text{GOLD}(i) = \{\epsilon\}$ - By optimizing the above objective, the model naturally learns to prune spans accurately. - Only gold mentions receive positive updates after the initial random pruning. - The model quickly leverages the learning signal for appropriate credit assignment to the different factors, such as the mention score $s_m$ used for pruning. - Note that if a span is either not a mention or have no good antecedent, it is assigned a dummy antecedent $\epsilon$ signalling that it is not in any coreferent relationship. --- ### Higher-order Coreference Resolution with Coarse-to-fine Inference - Recent coreference resolution systems have heavily relied on **first-order models**, where only pairs of entity mentions are scored by the model. - That includes the recent *e2e-coref* solution (`Lee et al., 2017`). - These moddels are computationally efficient and scalable to long documents. - However, because they make independent decisions about coreference links, they are susceptible to predicting clusters that are *locally consistent but globally inconsistent*. - The new higher-order inference uses the span-ranking architecture from `Lee et al. (2017)` in an iterative manner. - At each iteration, the **antecedent distribution is used as an attention mechanism** to optionally update span representations, - Enabling later coreference decisions to softly condition on earlier coreference decisions. - The new model also includes a **coarse-to-fine approach** that is learned with the objective, enabling an extra pruning step during inference. - This reduces the number of antecedents considered by the more accurate but inefficient fine factor. #### Background - The coreference resolution task formulation is identical to the mechanisms described in `Lee et al. (2017)` above. - That is, spans are scored and ranked based on singular mention score and pairwise mention scores with any earlier antecedent. - The baseline model is factored to enable a two-stage beam search. - A beam of up to $M$ potential mentions is computed based on the spans with the highest mention scores $s_m(i)$. - Pairwise coreference scores are only computed between surviving mentions during both training and inference. - The baseline model is a first-order model, since it only considers pairs of spans. #### Higher-order Coreference Resolution - First-order models are susceptible to consistency errors. - Modeling higher-order decisions at the document-level requires explicit inference. - This is due to the potentially very larger surface distance between mentions. - The new higher-order inference procedure address these problem while leaving the model differentiable. - This inference involves $N$ iterations of *refining span representations*. - $\pmb{g}^n_i$ denotes the representation of span $i$ at iteration $n$. - At iteration $n$, $\pmb{g}^n_i$ is computed with an attention mechanism that averages over previous representations of $\pmb{g}^{n-1}_j$ weighted according to how likely each mention $j$ is to be an antecedent to $i$. - The baseline model is used to initialize the span representation at $\pmb{g}^1_i$. - The refined span representations allow the model to also iteratively refine the antecedent distribution $P_n(y_i)$, where $s$ is the coreference scoring function: \begin{align} P_n(y_i) = \frac{e^{s(\pmb{g}^n_i, \pmb{g}^n_{y_i})}}{\sum_{y \in \mathcal{Y}(i)} e^{s(\pmb{g}^n_i, \pmb{g}^n_y)}} \end{align} - The scoring function uses the same parameters at every iteration, but it is given different span representations. - At each iteration, we first compute the expected antecedent representation $\pmb{a}^n_i$ of each span $i$ by using the current antecedent distribution $P_n(y_i)$ as an **attention mechanism**: \begin{align} \pmb{a}^n_i = \sum_{y_i \in \mathcal{Y}(i)} P_n(y_i) \cdot \pmb{g}^n_{y_i} \end{align} - In other words, the model integrate new information from the expected antecedents distribution of a span to refine the span itself. - The current span representation $\pmb{g}^n_i$ is then updated via interpolation with its expected antecedent representation $\pmb{a}^n_i$: \begin{align} \pmb{f}^n_i &= \sigma(\textbf{W}_f [\pmb{g}^n_i, \pmb{a}^n_i]) \\ \pmb{g}^{n+1}_i &= \pmb{f}^n_i \odot \pmb{g}^n_i + (1 - \pmb{f}^n_i) \odot \pmb{a}^n_i \end{align} - The learned gate vector $\pmb{f}^n_i$ determines for each dimension wheter to keep the current span information, or to integrate new information from its expected antecedent. - At iteration $n$, $\pmb{g}^n_i$ is an element-wise weighted average of approximately $n$ span representations (assuming $P_n(y_i)$ is peaked), allowing $P_n(y_i)$ to softly condition on up to $n$ other spans in the predicted cluster. - Span-ranking can be viewed as prediction latent antecedent trees, where the predicted antecedent is the parent of a span and each tree is a predicted cluster. - Another way to interpret this model, with iterative span representations and antecedent distributions refinement, is that the joint distribution $\prod_i P_N(y_i)$ implicitly models every directed path of up to length $N + 1$ in the latent antecedent tree. #### Coarse-to-fine Inference - The model above scales pooly to long documents, bottenecked by the pairwise antecedent score computation which is in polynomial matrix size. - Pairwise antecedent matrix size $M \times M \times (3|\pmb{g}| + |\phi|)$. - The computational challenge is made worse with the iterative inference process. ##### Heuristic antecedent pruning - The original paper consider only the nearest $K$ antecedents of each span, reducing matrix size to $M \times K \times (3|\pmb{g}| + |\phi|)$. - The main drawback is that this imposes an a priori limit on the maximum distance of a coreference link. ##### Coarse-to-fine antecedent pruning - The key component of the coarse-to-fine approach that does not establish a maximum coreference distance is an **alternate bilinear scoring function**: \begin{align} s_c(i, j) = \pmb{g}^\top_i \textbf{W}_c \pmb{g}_j \end{align} - $\textbf{W}_c$ is a learned weight matrix. - The bilinear $s_c(i, j)$ is far less accurate than the concatenation-based $s_a(i, j)$, but is much more efficient to compute (matrix size $M \times |\pmb{g}|$ and $M \times M$). - The above bilinear function $s_c(i, j)$ is used to compute a rough sketch of *likely* antecedent, by including it as an additional factor in the model: \begin{align} s(i, j) = s_m(i) + s_m(j) + s_c(i, j) + s_a(i, j) \end{align} - This aditional factor is leveraged as an additional beam pruning step. - The final inference procedure involves a three-stage beam search: - **First stage**: Keep the top $M$ spans based on mention score $s_m(i)$ of each span. - **Second stage**: Keep the top $K$ antecedents of each remaining span $i$ based on the first three factors, $s_m(i) + s_m(j) + s_c(i, j)$. - **Third stage**: The overall coreference $s(i, j)$ is computed based on the remaining span pairs, with the soft higher-order inference procedure described previously. - While the maximum-likelihood objective is computed over only the span pairs from the final stage, this coarse-to-fine approach expands the set of coreference links that the model can learn. - It achieves bettter performance while using a much smaller $K$ than heuristic pruning. --- ### BERT for Coreference Resolution: Baselines and Analysis - Recent BERT-based models have reported dramatic gains on multiple semantic benchmarks. - One of BERT's major improvements over previous methods is passage-level training, which allows it to *better model longer sequences*. - In the paper, BERT is fine-tuned to coreference resolution, achieving strong improvements on coref benchmarks. - There are two ways of extending the c2f-coref model in `Lee et al. (2018)` with BERT. - The *independent* variant uses non-overlapping segments each of which acts as an independent instance for BERT. - The *overlap* variant splits the document into overlapping segments so as to provide the model with context beyond 512 tokens. - Analysis shows that BERT-large is remarkably better at distinguishing related yet distinct entities or concepts (unlike BERT-base). - However, both models struggle to resolbe coreferences for cases that require world knowledge. - Likewise, modeling pronouns remains difficult, especially in conversations. - BERT-large also benefits from using longer context windows (384 word pieces), while BERT-base performs better with shorter contexts (128 word pieces). Both variants perform much worse with longer context windows (512 tokens) in spite of being trained on 512-size contexts. - The *overlap* variant with extended context window (beyond 512) provides no improvement. - This indicates that using larger context windows for pretrining might not translate into effective long-range features for a downstream task. - Larger models also exacerbate the memory-intensive nature of span representations (`Lee et al., 2017`). - In conclusion, there is still room for improvement in modeling document-level context, conversations, and mention paraphrasing. #### Method - For the experiments, we use the higher order coreference model in `Lee et al. (2018)`, referred to as *c2f-coref*. - A description of *c2f-coref* is found in the paper summary above. - To apply BERT, the paper replaces the entire LSTM-based encoder (with ELMo and GloVe embeddings as input) in c2f-coref with the BERT transformer. - The first and last word-pieces (concatenated with the attended version of all the word pieces in the span) are treated as span representations. - The paper experiment with two variants of splitting: - **Independent**: The *independent* variant uses non-overlapping segments each of which acts as an independent instance for BERT. - As BERT is trained on sequences of 512 word pieces, this variant has limited encoding capacity especially for end of segment tokens. - **Overlap**: The *overlap* variant splits the document into overlapping segments by creating a $T$-sized segment after every $T/2$ tokens. - These segments are then passed on to the BERT encoder independently, and - The final token representation is derived by element-wise interpolation of representations from both overlapping segments. - Let $\pmb{r_1} \in \mathbb{R}^d$ and $\pmb{r_2} \in \mathbb{R}^d$ be the token representations from the overlapping BERT segments. - The final representations $\pmb{r} \in \mathbb{R}^d$ is given by: \begin{align} \pmb{f} &= \sigma(\pmb{w}^T[\pmb{r_1}; \pmb{r_2}]) \\ \pmb{r} &= \pmb{f} \cdot \pmb{r_1} + (\pmb{1} - \pmb{f}) \cdot \pmb{r_2} \end{align} - $\pmb{w} \in \mathbb{R}^{2d \times d}$ is a trained parameter and $[;]$ represent concatenations. - This variant allows the model to artifically increase the context window beyond the `max_segment_len` hyperparameter. - All layers in both model variants are then finetuned following [`Devlin et al. (2019)`](https://www.aclweb.org/anthology/N19-1423/). #### Analysis ##### Strengths - There is no qualitative and quantitative differences between ELMo and BERT-base models. - However, BERT-large improves over BERT-base in several ways including pronoun resolution and lexical matching. - It is better at distinguishing related, but distinct, entities. ##### Weaknesses - Better modeling of document-level context, conversations, and entity paraphrasing might further improve the state of the art. - The models perform distinctly worse on longer documents. - Larger models might better encode longer context, but are memory-intensive due to span representations. - Future research in pretraining methods should look at more *effectively* encoding document-level context using sparse representations. - Modeling pronouns (especially in conversations) continues to be difficult for all models. - This perhaps partly because c2f-coref does verry little to *model dialog structure of the document*. - A considerable number of errors also suggest that models are still unable to resolve cases requiring *mention paraphrasing*.