NLP (Coreference Resolution) Preparatory Reading Note

tags: `NLP` `Coref Resolution`

List of Preparatory Reading Materials

Current preparatory reading material for beginners in Natural Language Processing (NLP) given a background in basic Machine learning (ML). Note that mentioned reseach papers are intended for people looking to work on coreference resolution, but may covers basic concepts for all of NLP.

Textbooks

Speech and Language Processing, Dan Jurafsky and James H. Martin, 3rd ed.

Reseaches

End-to-end Neural Coreference Resolution, Lee et al., 2017.
Higher-order Coreference Resolution with Coarse-to-fine Inference, Lee et al., 2018.
BERT for Coreference Resolution: Baselines and Analysis, Joshi et al., 2019.

Reading Note

Coreference Resolution

From Chapter 22: Coreference Resolution in Speech and Language Processing, Dan Jurafsky and James H. Martin, 3rd ed.

Introduction to Coreference Resolution

Albert Einstein was a German-born theoretical physicist who developed the theory of relativity, one of the two pillars of modern physics. His work is also known for its influence on the philosophy of science.

Linguistic expressions such as Albert Einstein or his are called mentions or referring expressions, and the discourse entity that is referred to (Albert Einstein) the referent.
Two or more referring expressions that are used to refer to the same discourse entity are said to corefer (such as Albert Einstein and his).
A discourse model is a mental model that the system (or a human hearer) builds incrementally as it interprets a text;
- A model contains representations of the entities referred to in the text, as well as the properties of the entities and relations among them.
- When a referrent is first mentioned in a discourse, a representation for it is evoked into the model. Upon subsequent mention, this representation is accessed from the model.
Reference in a text to an entity that has been previously introduced into the discourse is called anaphora,
- The referring expression used is said to be an anaphor/anaphoric.
The anaphor corefer with a prior mention that is called the antecedent,
- Not every referring expression is an antecedent. An entity that has only a single mention in a text is called a singleton.
Coreference resolution is the task of determining whether two mentions corefers, by which we mean they refer to the same entity in the discourse model (the same discourse entity).
- The set of corefering expressions is often called a coreference chain or a cluster.
Entity linking, or entity resolution, is the task of mapping a discourse entity to some real-world individual.

End-to-end Neural Coreference Resolution

The approach mentioned in the paper does not require existing resources such as syntactic parsers.
- Syntactic parsing (what is it?) as input to hand-engineered mention proposal algorithms are not needed.
- Their solution is an end-to-end neural net that learn both which spans are entity mentions and how to best cluster them.
The model is a span-ranking and optimizing system that consider a space of all spans (up to a length) and determine if any span is a likely antecedent to another.
- Span-ranking model that determine if any of the previous span is a good antecedent.
- The core of the model are therefore vector embeddings representing spans of text in a document.
Scoring all span pairs in the model is impractical due to the sheer number of spans.
- The model is therefore factored over unary mention scores and pairwise antecedent scores.
- Unary mention scores are used to prune the space of spans and antecedents.

Previous high-performance models with head-parsers and heavily-engineered mention proposal algorithms pipelines have two major drawbacks:
- Cascading errors caused by parsing mistakes, and
- Hand-engineered rules do not generalize to new languages or domains.
The method described in the study produces non-pipelined results, with a span-ranking approach that is most similar to mention ranking, while reasoning over a larger space by jointly detecting mentions and predicting coreference.

Task

Task: Assigning each span
$i$ in the document
$D$ an antecedent
$y_{i}$ .

Model

The model aim to learn a conditional probability distribution of all antecedents given the document
$D$ .
- For each span, a series of antecedents are considered and calculated with a pairwise score
  $s (i, j)$ for a reference link between span
  $i$ and span
  $j$ in
  $D$ .
- There are three factors for the pairwise coreference score:
  - Whether span
    $i$ is a mention (
    $s_{m} (i)$ ),
  - Whether span
    $j$ is a mention (
    $s_{m} (j)$ ), and
  - Whether
    $j$ is an antecedent of
    $i$ (
    $s_{a} (i, j)$ ).
- The above factoring enables aggressive pruning of spans that are unlikely to belong to a coreference cluster according to the mention score
  $s_{m} (i)$ .
Scoring Architecture: At the core of the scoring model are vector representations
$g_{i}$ for each possible span
$i$ .
- The scoring functions are computed via standard feedforward neural networks:
  
  $\begin{aligned} s_{m} (i) & = w_{m} \cdot {FFNN}_{m} (g_{i}) \\ s_{a} (i, j) & = w_{a} \cdot {FFNN}_{a} ([g_{i}, g_{j}, g_{i} ⊙ g_{j}, ϕ (i, j)]) \end{aligned}$
  - $ϕ (i, j)$ is a feature vector encoding speaker and genre information from the metadata and the distance between two spans.
Span Representations:
- Two types of information are crucial to accurately predicting coreference links:
  - The context surrounding the mention span, and
  - The internal structure within the span.
  - These information is represented by the boundary representations.
- The model uses a bidirectional LSTM to encode the lexical information of both the inside and outside of each span.
  - First, use bidirectional LSTMs to encode every word in its context.
    
    $\begin{aligned} f_{t, δ} & = σ (W_{f} [x_{t}, h_{t + δ, δ}] + b_{i}) \\ o_{t, δ} & = σ (W_{o} [x_{t}, h_{t + δ, δ}] + b_{o}) \\ {\tilde{c}}_{t, δ} & = \tanh (W_{c} [x_{t}, h_{t + δ, δ}] + b_{c}) \\ c_{t, δ} & = f_{t, δ} \cdot {\tilde{c}}_{t, δ} + (1 - f_{t, δ}) \cdot c_{t + δ, δ} \\ h_{t, δ} & = o_{t, δ} \cdot \tanh (c_{t, δ}) \\ x_{t}^{*} & = [h_{t, 1}, h_{t, - 1}] \end{aligned}$
  - Independent LSTMs are used for every sentence, since cross-sentence context was not helpful.
- Instead of relying on syntactic parses to find head, the model includes an attention mechanism over words in each span to model head words:
  
  $\begin{aligned} α_{t} & = w_{α} \cdot {FFNN}_{α} (x_{t}^{*}) \\ a_{i, t} & = \frac{\exp (α_{t})}{\sum_{k = START (i)}^{END (i)} \exp (α_{k})} \\ {\hat{x}}_{i} & = \sum_{k = START (i)}^{END (i)} a_{i, t} \cdot x_{t} \end{aligned}$
  - ${\hat{x}}_{i}$ is a weighted sum of word vectors in span
    $i$ .
  - The weights
    $a_{i, t}$ are automatically learned and correlate strongly with traditional definitions of head words.
  - The attention mechanism is not comparing any two entities, rather it takes the unary score of each word and see how important it is to the rest of the span.
- The above span information is concatenated to produce the final representation
  $g_{i}$ of span
  $i$ :
  
  $\begin{array}{r} g_{i} = [x_{START (i)}^{*}, x_{END (i)}^{*}, {\hat{x}}_{i}, ϕ (i)] \end{array}$
  - $x_{START (i)}^{*}, x_{END (i)}^{*}$ are boundary representations.
  - ${\hat{x}}_{i}$ is the soft head word vector, and
  - $ϕ (i)$ is a feature vector encoding the size of span
    $i$ .

Inference

The size fo the full model described above is
$O (T^{4})$ in the document length
$T$ .
- To maintain efficiency, candidate spans are pruned greedily during both training and evaluation.
The model only consider spans with up to
$L$ words and compute their unary mention scores
$s_{m} (i)$ .
- The model only keeps up to
  $θ T$ spans with the highest mention scores and consider only up to
  $K$ antecedents for each.
- The spans are accepted in decreasing order of the mention score unless if span
  $i$ under consideration overlaps an already accepted span
  $j$ .
For the remaining mentions, the joint distribution of antecedents for each document is computed in a forward pass over a single computation graph.
- The final prediction is the clustering produced by the most likely configuration.

Learning

The learning process optimize the likelihood of all correct antecedents implied by the gold clustering:

$\begin{array}{r} \log \prod_{i = 1}^{N} \sum_{\hat{y} \in Y (i) \cap GOLD (i)} P (\hat{y}) \end{array}$
- Basically, considering all spans in a document,
  - For each span, you sum up the probabilities of its potentially correct antecedents.
  - If the sum is 1, the model is not incurring any loss.
  - Related to Durett & Klein (2013).
- $GOLD (i)$ is the set of spans in the gold clustering containing span
  $i$ .
- If span
  $i$ does not belong to a gold cluster or all gold antecedents have been pruned,
  $GOLD (i) = {ϵ}$
By optimizing the above objective, the model naturally learns to prune spans accurately.
- Only gold mentions receive positive updates after the initial random pruning.
- The model quickly leverages the learning signal for appropriate credit assignment to the different factors, such as the mention score
  $s_{m}$ used for pruning.
Note that if a span is either not a mention or have no good antecedent, it is assigned a dummy antecedent
$ϵ$ signalling that it is not in any coreferent relationship.

Higher-order Coreference Resolution with Coarse-to-fine Inference

Recent coreference resolution systems have heavily relied on first-order models, where only pairs of entity mentions are scored by the model.
- That includes the recent e2e-coref solution (Lee et al., 2017).
- These moddels are computationally efficient and scalable to long documents.
- However, because they make independent decisions about coreference links, they are susceptible to predicting clusters that are locally consistent but globally inconsistent.
The new higher-order inference uses the span-ranking architecture from Lee et al. (2017) in an iterative manner.
- At each iteration, the antecedent distribution is used as an attention mechanism to optionally update span representations,
  - Enabling later coreference decisions to softly condition on earlier coreference decisions.
- The new model also includes a coarse-to-fine approach that is learned with the objective, enabling an extra pruning step during inference.
  - This reduces the number of antecedents considered by the more accurate but inefficient fine factor.

Background

The coreference resolution task formulation is identical to the mechanisms described in Lee et al. (2017) above.
- That is, spans are scored and ranked based on singular mention score and pairwise mention scores with any earlier antecedent.
- The baseline model is factored to enable a two-stage beam search.
  - A beam of up to
    $M$ potential mentions is computed based on the spans with the highest mention scores
    $s_{m} (i)$ .
  - Pairwise coreference scores are only computed between surviving mentions during both training and inference.
- The baseline model is a first-order model, since it only considers pairs of spans.

Higher-order Coreference Resolution

First-order models are susceptible to consistency errors.
- Modeling higher-order decisions at the document-level requires explicit inference.
  - This is due to the potentially very larger surface distance between mentions.
The new higher-order inference procedure address these problem while leaving the model differentiable.
This inference involves
$N$ iterations of refining span representations.
- $g g_{i}^{n}$ denotes the representation of span
  $i$ at iteration
  $n$ .
- At iteration
  $n$ ,
  $g g_{i}^{n}$ is computed with an attention mechanism that averages over previous representations of
  $g g_{j}^{n - 1}$ weighted according to how likely each mention
  $j$ is to be an antecedent to
  $i$ .
The baseline model is used to initialize the span representation at
$g g_{i}^{1}$ .
The refined span representations allow the model to also iteratively refine the antecedent distribution
$P_{n} (y_{i})$ , where
$s$ is the coreference scoring function:

$\begin{array}{r} P_{n} (y_{i}) = \frac{e^{s (g g_{i}^{n}, g g_{y_{i}}^{n})}}{\sum_{y \in Y (i)} e^{s (g g_{i}^{n}, g g_{y}^{n})}} \end{array}$
- The scoring function uses the same parameters at every iteration, but it is given different span representations.
At each iteration, we first compute the expected antecedent representation
$a a_{i}^{n}$ of each span
$i$ by using the current antecedent distribution
$P_{n} (y_{i})$ as an attention mechanism:

$\begin{array}{r} a a_{i}^{n} = \sum_{y_{i} \in Y (i)} P_{n} (y_{i}) \cdot g g_{y_{i}}^{n} \end{array}$
- In other words, the model integrate new information from the expected antecedents distribution of a span to refine the span itself.
The current span representation
$g g_{i}^{n}$ is then updated via interpolation with its expected antecedent representation
$a a_{i}^{n}$ :

$\begin{aligned} f f_{i}^{n} & = σ (W_{f} [g g_{i}^{n}, a a_{i}^{n}]) \\ g g_{i}^{n + 1} & = f f_{i}^{n} ⊙ g g_{i}^{n} + (1 - f f_{i}^{n}) ⊙ a a_{i}^{n} \end{aligned}$
- The learned gate vector
  $f f_{i}^{n}$ determines for each dimension wheter to keep the current span information, or to integrate new information from its expected antecedent.
- At iteration
  $n$ ,
  $g g_{i}^{n}$ is an element-wise weighted average of approximately
  $n$ span representations (assuming
  $P_{n} (y_{i})$ is peaked), allowing
  $P_{n} (y_{i})$ to softly condition on up to
  $n$ other spans in the predicted cluster.
Span-ranking can be viewed as prediction latent antecedent trees, where the predicted antecedent is the parent of a span and each tree is a predicted cluster.
- Another way to interpret this model, with iterative span representations and antecedent distributions refinement, is that the joint distribution
  $\prod_{i} P_{N} (y_{i})$ implicitly models every directed path of up to length
  $N + 1$ in the latent antecedent tree.

Coarse-to-fine Inference

The model above scales pooly to long documents, bottenecked by the pairwise antecedent score computation which is in polynomial matrix size.
- Pairwise antecedent matrix size
  $M \times M \times (3 | g g | + | ϕ |)$ .
- The computational challenge is made worse with the iterative inference process.

Heuristic antecedent pruning

The original paper consider only the nearest
$K$ antecedents of each span, reducing matrix size to
$M \times K \times (3 | g g | + | ϕ |)$ .
- The main drawback is that this imposes an a priori limit on the maximum distance of a coreference link.

Coarse-to-fine antecedent pruning

The key component of the coarse-to-fine approach that does not establish a maximum coreference distance is an alternate bilinear scoring function:

$\begin{array}{r} s_{c} (i, j) = g g_{i}^{⊤} W_{c} g g_{j} \end{array}$
- $W_{c}$ is a learned weight matrix.
- The bilinear
  $s_{c} (i, j)$ is far less accurate than the concatenation-based
  $s_{a} (i, j)$ , but is much more efficient to compute (matrix size
  $M \times | g g |$ and
  $M \times M$ ).
The above bilinear function
$s_{c} (i, j)$ is used to compute a rough sketch of likely antecedent, by including it as an additional factor in the model:

$\begin{array}{r} s (i, j) = s_{m} (i) + s_{m} (j) + s_{c} (i, j) + s_{a} (i, j) \end{array}$
- This aditional factor is leveraged as an additional beam pruning step.
The final inference procedure involves a three-stage beam search:
- First stage: Keep the top
  $M$ spans based on mention score
  $s_{m} (i)$ of each span.
- Second stage: Keep the top
  $K$ antecedents of each remaining span
  $i$ based on the first three factors,
  $s_{m} (i) + s_{m} (j) + s_{c} (i, j)$ .
- Third stage: The overall coreference
  $s (i, j)$ is computed based on the remaining span pairs, with the soft higher-order inference procedure described previously.
While the maximum-likelihood objective is computed over only the span pairs from the final stage, this coarse-to-fine approach expands the set of coreference links that the model can learn.
- It achieves bettter performance while using a much smaller
  $K$ than heuristic pruning.

BERT for Coreference Resolution: Baselines and Analysis

Recent BERT-based models have reported dramatic gains on multiple semantic benchmarks.
- One of BERT's major improvements over previous methods is passage-level training, which allows it to better model longer sequences.
- In the paper, BERT is fine-tuned to coreference resolution, achieving strong improvements on coref benchmarks.
There are two ways of extending the c2f-coref model in Lee et al. (2018) with BERT.
- The independent variant uses non-overlapping segments each of which acts as an independent instance for BERT.
- The overlap variant splits the document into overlapping segments so as to provide the model with context beyond 512 tokens.
Analysis shows that BERT-large is remarkably better at distinguishing related yet distinct entities or concepts (unlike BERT-base).
- However, both models struggle to resolbe coreferences for cases that require world knowledge.
- Likewise, modeling pronouns remains difficult, especially in conversations.
BERT-large also benefits from using longer context windows (384 word pieces), while BERT-base performs better with shorter contexts (128 word pieces). Both variants perform much worse with longer context windows (512 tokens) in spite of being trained on 512-size contexts.
- The overlap variant with extended context window (beyond 512) provides no improvement.
- This indicates that using larger context windows for pretrining might not translate into effective long-range features for a downstream task.
- Larger models also exacerbate the memory-intensive nature of span representations (Lee et al., 2017).
In conclusion, there is still room for improvement in modeling document-level context, conversations, and mention paraphrasing.

Method

For the experiments, we use the higher order coreference model in Lee et al. (2018), referred to as c2f-coref.
- A description of c2f-coref is found in the paper summary above.
To apply BERT, the paper replaces the entire LSTM-based encoder (with ELMo and GloVe embeddings as input) in c2f-coref with the BERT transformer.
- The first and last word-pieces (concatenated with the attended version of all the word pieces in the span) are treated as span representations.
The paper experiment with two variants of splitting:
- Independent: The independent variant uses non-overlapping segments each of which acts as an independent instance for BERT.
  - As BERT is trained on sequences of 512 word pieces, this variant has limited encoding capacity especially for end of segment tokens.
- Overlap: The overlap variant splits the document into overlapping segments by creating a
  $T$ -sized segment after every
  $T / 2$ tokens.
  - These segments are then passed on to the BERT encoder independently, and
  - The final token representation is derived by element-wise interpolation of representations from both overlapping segments.
    - Let
      $r_{1} r_{1} \in R^{d}$ and
      $r_{2} r_{2} \in R^{d}$ be the token representations from the overlapping BERT segments.
    - The final representations
      $r r \in R^{d}$ is given by:
      
      $\begin{aligned} f f & = σ (w w^{T} [r_{1} r_{1}; r_{2} r_{2}]) \\ r r & = f f \cdot r_{1} r_{1} + (1 1 - f f) \cdot r_{2} r_{2} \end{aligned}$
    - $w w \in R^{2 d \times d}$ is a trained parameter and
      $[;]$ represent concatenations.
  - This variant allows the model to artifically increase the context window beyond the max_segment_len hyperparameter.
All layers in both model variants are then finetuned following Devlin et al. (2019).

Analysis

Strengths

There is no qualitative and quantitative differences between ELMo and BERT-base models.
However, BERT-large improves over BERT-base in several ways including pronoun resolution and lexical matching.
- It is better at distinguishing related, but distinct, entities.

Weaknesses

Better modeling of document-level context, conversations, and entity paraphrasing might further improve the state of the art.
- The models perform distinctly worse on longer documents.
Larger models might better encode longer context, but are memory-intensive due to span representations.
- Future research in pretraining methods should look at more effectively encoding document-level context using sparse representations.
Modeling pronouns (especially in conversations) continues to be difficult for all models.
- This perhaps partly because c2f-coref does verry little to model dialog structure of the document.
A considerable number of errors also suggest that models are still unable to resolve cases requiring mention paraphrasing.