# Sentence Encoder with Parsers ## Intro ### Few backgrounds * Problems with 1. sequential encoders * lack of ability to generalize to out-of-distribution data * it's nearly impossible to cover every cases in natural language 2. tree-structure encoders (Tree-LSTM) * only use the top-1 inferred derivation as the "gold" derivation. however, derivations given by parser are likely contain many errors (e.g. wrong attachment, mislabeling). * doesn't seem to have much gain in performance compared to sequential encoders ### Core idea * To problem 1, build sentence embedding following the syntactic derivation, similar to Tree-LSTM * To problem 2, aggregate parses within a neighborhood ~~two ways to look at the information provided by the parsers~~ 1. ~~A distribution of all possible derivation~~ 2. ~~A cluster within the neighbor of the gold parse (top-k)~~ 3. Top-1 derivation from different parser ### Observations #### Within parsers ##### RNNG -- incrementality? 1. Preferred local attachment * strong at discovering NP modifying PP * weak at NP modifying PP followed by VP modifying PP * a strong tendency to recursion PP ##### Elmo-enhanced span parser 1. Preferred gloval attachment * weak at discovering NP modifying PP * strong at NP modifying PP followed by VP modifying PP 2. Weak against PP recursion 3. ##### PCFG? ##### Benepar - Berkeley's self-attentive parser 1. significantly stronger than elmo-span & rnng * make less mistakes * partially recognize the recursive construction ###### reasons? #### Corss parsers ### Two viewpoints, two approach 1. ~~A distribution of all possible derivation~~ * ~~Take the derivation as a latent variable and marginalize it (by sampling)~~ 2. A cluster within the neighbor of the gold parse (top-k) * Take each derivatio in top-k set as a noisy value * aggregate them for a more robust "representation" of true derivation. #### Definition of a neighborhood 1. parses that are similar in Tree-Edit Distance? 2. parses with similar probability? <!-- Case study: Correct Derivation ![](https://i.imgur.com/Uf8aDpM.png) Incorrect ![](https://i.imgur.com/hKrG8OV.png) --> <!-- predicate-argument structure (NP_1(VP NP_2)) -> (agent VP NP_1) (patient VP NP_2) (hedgehog (VP NP_2)) -> (agent VP hedgedog) (NP_1 (VP hedgehog)) !-> (patient VP hedgehog) (NP_1 (VP hedgehog)) -> (patient VP hedgehod) () --> ### Imaginable merits * Compared to sequential models * Structural informaiton can be beneficial when generalizing to out-of-distribution data * Compared to Tree-LSTM * Reducing errors from the parsing stage by taking multiple errorous variants of the same derivation. * (?) Can be able to handle ambigious sentences ## Details ### How to aggregate derivations * sequential baseline * Attention LSTM * Tree-Structure models * Tree-LSTM (putting the embeddings of each phrase alongside with embeddings of surface words, and do attention over them) * Tree-GRU * GCN * Top-k aggregation * summing context vector of the top-k derivations by weight * normalized probability from the parsers * uniform * one more layer of attention (?) ### How to sample over CFG space * NCE * Importance sampling <!-- #### Question for this section * Whether a Tree-LSTM given gold parse can beat seqential models in downstream tasks * Not necessarily, as semantics-oriented task does not always syntactic sensitive (e.g. bag of word for text classification) --> <!-- ## Modeling ### Text classification want to maximize $\underset{\theta}{\mathrm{argmax}}P(l_i|x_i;\theta)$ for each samples ### How to model text classification with possible parses behind the surface --> <!-- * Suppose $x_i$ is drawn from some distribution i.i.d --> <!-- * Assumption: --> <!-- * labels are assigned by the “meaning” rather than words of the surface. --> <!-- * meanings can be uniquely determined by the deep structure and surface word (for structural ambiguity) --> <!-- * (P.S.) Most lexical ambiguity would be resolved by 1) the context and 2) the grammatical role (POS tag, role on predicate-argument structure) it undertakes --> <!-- * Under the two assumption, rewrite the objective as --> <!-- * taking derivations as a latent variable $$P(l|x;\theta)\approx E_{z \sim T(x)}[P(l|x,z)]$$ * taking derivations as a noisy value $$P(l|x;\theta)\approx P(l|x,z_{true};\theta)$$ --> <!-- * On the approximation of * $S$ * RNN composition function * Tree-LSTM * $P(z|x)$ * ~~probability output by the parser~~ * sampling over the CFG space * taking expectation over top-k space --> <!-- ## Problems ### Gap between human grammar and Context-Free grammar #### General explaination * As assumption 2, the meanings of a sentence can be composed by a derivation licensable by human grammar * nothing can be said about the derivation outside human grammar * Derivation space licensable by CFG is much larger than that by human grammar * space of CFG derivation is exponential to the sentence length * derivations licensable by human grammar is rather limited * $\implies$ * meaning $\subseteq$ sentence code #### Solutions? ##### top-k truncation (space shrinking) * conjectures : * derivations in top-k space are similar to the gold derivation --> <!-- * all human licensable derivations will appear at the top-k derivations, though not necessary top-1 --> <!-- * What to do : * Consider the derivations in top-k space as the observable with noise rather than the inferred $\rightarrow$ eliminate the need for sampling --> <!-- * derivations at top-k space as positive and the remaining as negative $\rightarrow$ allow negative sampling? as a normalization technique --> <!-- * How to do: * do aggregation * combine embeddings of different derivations with probability given by the parser * probability $:=$ normalized score of the derivation? * change objective from $E_{z \sim T(x)}[P(l|x,z)]$ to $P(l|x, z_1,...,z_k)$ (a bit strange here) --> <!-- CYK -> sharpened probability --> <!-- * training with negative samples * Changing the objective to ${\mathrm{argmax}}\lambda P_\theta(l|C(\hat{z},x))+(1-\lambda)\sum_{z' \in T(x), |{z'}|=N} P(l|z',x)\frac{1}{N}$ where $\hat{z}=(z_1,...,z_k)$ is from the top-k derivation space of $T(x)$ --> <!-- ##### shaprening moving distribution mass to the more salient derivations while still computing over the CFG derivation space * how to define saliency? * top-k? * above certain probability threshold? Equivalent to sampling? --> ------------------------------------- Structural helps on compositional generalization? Experiment to conduct: TreeLSTM + structural attention on COGS baseline: Seq2seq on LSTM & Transformer Mod: encoder * Sequential LSTM -> TreeLSTM * add structural attention ---------------------------- Truncation vs. sharpening Truncation: * cut down the derivation space to top-k * take the others as negative samples Sharpening: The objective of sharpening is to move distribution mass to the more salient ones * saliency? * top-k? * above certain probability threshold? Not very persuasive... Since they're all about 0 probability, how about sampling? Experiment * On tasks requiring structural generalization * semantic parsing: COGS * structural model better than sequential model * top-k -> advantage over top-1 * acceptability test: COLA * other datasets ------------------------------------ ### Is Tree-LSTM really suitable for processing the "meaning" of a sentence? * PTB $\in$ so called phrasal approach to predicate-argument structure (formal semantics) * each production rule applied establish some predicate-argument relation * What does an embedding for a phrase mean? * the set of predicate-argument relations establishable within the phrase? * or others? * $\implies$ does it mean we need separate kernels for each production rules? so that we can capture the "semantics" of the production role! * Then under this argument, why do we need a recurrent-like network (Tree-LSTM/RNN) for building the embedding for phrases? under the CFG, long-term dependency doesn't exist anymore as it's captured by production rules. while the point of LSTM is to capture the long-term dependencies * does it mean GCN is better than Tree-LSTM for building phrasal embeddings?