# Sentence Encoder with Parsers
## Intro
### Few backgrounds
* Problems with
1. sequential encoders
* lack of ability to generalize to out-of-distribution data
* it's nearly impossible to cover every cases in natural language
2. tree-structure encoders (Tree-LSTM)
* only use the top-1 inferred derivation as the "gold" derivation. however, derivations given by parser are likely contain many errors (e.g. wrong attachment, mislabeling).
* doesn't seem to have much gain in performance compared to sequential encoders
### Core idea
* To problem 1, build sentence embedding following the syntactic derivation, similar to Tree-LSTM
* To problem 2, aggregate parses within a neighborhood ~~two ways to look at the information provided by the parsers~~
1. ~~A distribution of all possible derivation~~
2. ~~A cluster within the neighbor of the gold parse (top-k)~~
3. Top-1 derivation from different parser
### Observations
#### Within parsers
##### RNNG -- incrementality?
1. Preferred local attachment
* strong at discovering NP modifying PP
* weak at NP modifying PP followed by VP modifying PP
* a strong tendency to recursion PP
##### Elmo-enhanced span parser
1. Preferred gloval attachment
* weak at discovering NP modifying PP
* strong at NP modifying PP followed by VP modifying PP
2. Weak against PP recursion
3.
##### PCFG?
##### Benepar - Berkeley's self-attentive parser
1. significantly stronger than elmo-span & rnng
* make less mistakes
* partially recognize the recursive construction
###### reasons?
#### Corss parsers
### Two viewpoints, two approach
1. ~~A distribution of all possible derivation~~
* ~~Take the derivation as a latent variable and marginalize it (by sampling)~~
2. A cluster within the neighbor of the gold parse (top-k)
* Take each derivatio in top-k set as a noisy value
* aggregate them for a more robust "representation" of true derivation.
#### Definition of a neighborhood
1. parses that are similar in Tree-Edit Distance?
2. parses with similar probability?
<!-- Case study:
Correct Derivation

Incorrect
 -->
<!-- predicate-argument structure
(NP_1(VP NP_2)) -> (agent VP NP_1) (patient VP NP_2)
(hedgehog (VP NP_2)) -> (agent VP hedgedog)
(NP_1 (VP hedgehog)) !-> (patient VP hedgehog)
(NP_1 (VP hedgehog)) -> (patient VP hedgehod) () -->
### Imaginable merits
* Compared to sequential models
* Structural informaiton can be beneficial when generalizing to out-of-distribution data
* Compared to Tree-LSTM
* Reducing errors from the parsing stage by taking multiple errorous variants of the same derivation.
* (?) Can be able to handle ambigious sentences
## Details
### How to aggregate derivations
* sequential baseline
* Attention LSTM
* Tree-Structure models
* Tree-LSTM (putting the embeddings of each phrase alongside with embeddings of surface words, and do attention over them)
* Tree-GRU
* GCN
* Top-k aggregation
* summing context vector of the top-k derivations by weight
* normalized probability from the parsers
* uniform
* one more layer of attention (?)
### How to sample over CFG space
* NCE
* Importance sampling
<!-- #### Question for this section
* Whether a Tree-LSTM given gold parse can beat seqential models in downstream tasks
* Not necessarily, as semantics-oriented task does not always syntactic sensitive (e.g. bag of word for text classification)
-->
<!-- ## Modeling
### Text classification
want to maximize $\underset{\theta}{\mathrm{argmax}}P(l_i|x_i;\theta)$ for each samples
### How to model text classification with possible parses behind the surface -->
<!-- * Suppose $x_i$ is drawn from some distribution i.i.d -->
<!-- * Assumption: -->
<!-- * labels are assigned by the “meaning” rather than words of the surface. -->
<!-- * meanings can be uniquely determined by the deep structure and surface word (for structural ambiguity) -->
<!-- * (P.S.) Most lexical ambiguity would be resolved by 1) the context and 2) the grammatical role (POS tag, role on predicate-argument structure) it undertakes -->
<!-- * Under the two assumption, rewrite the objective as -->
<!-- * taking derivations as a latent variable
$$P(l|x;\theta)\approx E_{z \sim T(x)}[P(l|x,z)]$$
* taking derivations as a noisy value
$$P(l|x;\theta)\approx P(l|x,z_{true};\theta)$$ -->
<!-- * On the approximation of
* $S$
* RNN composition function
* Tree-LSTM
* $P(z|x)$
* ~~probability output by the parser~~
* sampling over the CFG space
* taking expectation over top-k space -->
<!-- ## Problems
### Gap between human grammar and Context-Free grammar
#### General explaination
* As assumption 2, the meanings of a sentence can be composed by a derivation licensable by human grammar
* nothing can be said about the derivation outside human grammar
* Derivation space licensable by CFG is much larger than that by human grammar
* space of CFG derivation is exponential to the sentence length
* derivations licensable by human grammar is rather limited
* $\implies$
* meaning $\subseteq$ sentence code
#### Solutions?
##### top-k truncation (space shrinking)
* conjectures :
* derivations in top-k space are similar to the gold derivation -->
<!-- * all human licensable derivations will appear at the top-k derivations, though not necessary top-1 -->
<!-- * What to do :
* Consider the derivations in top-k space as the observable with noise rather than the inferred $\rightarrow$ eliminate the need for sampling -->
<!-- * derivations at top-k space as positive and the remaining as negative $\rightarrow$ allow negative sampling? as a normalization technique -->
<!-- * How to do:
* do aggregation
* combine embeddings of different derivations with probability given by the parser
* probability $:=$ normalized score of the derivation?
* change objective from $E_{z \sim T(x)}[P(l|x,z)]$ to $P(l|x, z_1,...,z_k)$ (a bit strange here)
-->
<!-- CYK -> sharpened probability -->
<!-- * training with negative samples
* Changing the objective to ${\mathrm{argmax}}\lambda P_\theta(l|C(\hat{z},x))+(1-\lambda)\sum_{z' \in T(x), |{z'}|=N} P(l|z',x)\frac{1}{N}$ where $\hat{z}=(z_1,...,z_k)$ is from the top-k derivation space of $T(x)$ -->
<!-- ##### shaprening
moving distribution mass to the more salient derivations while still computing over the CFG derivation space
* how to define saliency?
* top-k?
* above certain probability threshold?
Equivalent to sampling? -->
-------------------------------------
Structural helps on compositional generalization?
Experiment to conduct: TreeLSTM + structural attention on COGS
baseline: Seq2seq on LSTM & Transformer
Mod: encoder
* Sequential LSTM -> TreeLSTM
* add structural attention
----------------------------
Truncation vs. sharpening
Truncation:
* cut down the derivation space to top-k
* take the others as negative samples
Sharpening:
The objective of sharpening is to move distribution mass to the more salient ones
* saliency?
* top-k?
* above certain probability threshold?
Not very persuasive... Since they're all about 0 probability, how about sampling?
Experiment
* On tasks requiring structural generalization
* semantic parsing: COGS
* structural model better than sequential model
* top-k -> advantage over top-1
* acceptability test: COLA
* other datasets
------------------------------------
### Is Tree-LSTM really suitable for processing the "meaning" of a sentence?
* PTB $\in$ so called phrasal approach to predicate-argument structure (formal semantics)
* each production rule applied establish some predicate-argument relation
* What does an embedding for a phrase mean?
* the set of predicate-argument relations establishable within the phrase?
* or others?
* $\implies$ does it mean we need separate kernels for each production rules? so that we can capture the "semantics" of the production role!
* Then under this argument, why do we need a recurrent-like network (Tree-LSTM/RNN) for building the embedding for phrases? under the CFG, long-term dependency doesn't exist anymore as it's captured by production rules. while the point of LSTM is to capture the long-term dependencies
* does it mean GCN is better than Tree-LSTM for building phrasal embeddings?