Sentence Encoder with Parsers

# Sentence Encoder with Parsers ## Intro ### Few backgrounds * Problems with 1. sequential encoders * lack of ability to generalize to out-of-distribution data * it's nearly impossible to cover every cases in natural language 2. tree-structure encoders (Tree-LSTM) * only use the top-1 inferred derivation as the "gold" derivation. however, derivations given by parser are likely contain many errors (e.g. wrong attachment, mislabeling). * doesn't seem to have much gain in performance compared to sequential encoders ### Core idea * To problem 1, build sentence embedding following the syntactic derivation, similar to Tree-LSTM * To problem 2, aggregate parses within a neighborhood ~~two ways to look at the information provided by the parsers~~ 1. ~~A distribution of all possible derivation~~ 2. ~~A cluster within the neighbor of the gold parse (top-k)~~ 3. Top-1 derivation from different parser ### Observations #### Within parsers ##### RNNG -- incrementality? 1. Preferred local attachment * strong at discovering NP modifying PP * weak at NP modifying PP followed by VP modifying PP * a strong tendency to recursion PP ##### Elmo-enhanced span parser 1. Preferred gloval attachment * weak at discovering NP modifying PP * strong at NP modifying PP followed by VP modifying PP 2. Weak against PP recursion 3. ##### PCFG? ##### Benepar - Berkeley's self-attentive parser 1. significantly stronger than elmo-span & rnng * make less mistakes * partially recognize the recursive construction ###### reasons? #### Corss parsers ### Two viewpoints, two approach 1. ~~A distribution of all possible derivation~~ * ~~Take the derivation as a latent variable and marginalize it (by sampling)~~ 2. A cluster within the neighbor of the gold parse (top-k) * Take each derivatio in top-k set as a noisy value * aggregate them for a more robust "representation" of true derivation. #### Definition of a neighborhood 1. parses that are similar in Tree-Edit Distance? 2. parses with similar probability?   ### Imaginable merits * Compared to sequential models * Structural informaiton can be beneficial when generalizing to out-of-distribution data * Compared to Tree-LSTM * Reducing errors from the parsing stage by taking multiple errorous variants of the same derivation. * (?) Can be able to handle ambigious sentences ## Details ### How to aggregate derivations * sequential baseline * Attention LSTM * Tree-Structure models * Tree-LSTM (putting the embeddings of each phrase alongside with embeddings of surface words, and do attention over them) * Tree-GRU * GCN * Top-k aggregation * summing context vector of the top-k derivations by weight * normalized probability from the parsers * uniform * one more layer of attention (?) ### How to sample over CFG space * NCE * Importance sampling                   ------------------------------------- Structural helps on compositional generalization? Experiment to conduct: TreeLSTM + structural attention on COGS baseline: Seq2seq on LSTM & Transformer Mod: encoder * Sequential LSTM -> TreeLSTM * add structural attention ---------------------------- Truncation vs. sharpening Truncation: * cut down the derivation space to top-k * take the others as negative samples Sharpening: The objective of sharpening is to move distribution mass to the more salient ones * saliency? * top-k? * above certain probability threshold? Not very persuasive... Since they're all about 0 probability, how about sampling? Experiment * On tasks requiring structural generalization * semantic parsing: COGS * structural model better than sequential model * top-k -> advantage over top-1 * acceptability test: COLA * other datasets ------------------------------------ ### Is Tree-LSTM really suitable for processing the "meaning" of a sentence? * PTB $\in$ so called phrasal approach to predicate-argument structure (formal semantics) * each production rule applied establish some predicate-argument relation * What does an embedding for a phrase mean? * the set of predicate-argument relations establishable within the phrase? * or others? * $\implies$ does it mean we need separate kernels for each production rules? so that we can capture the "semantics" of the production role! * Then under this argument, why do we need a recurrent-like network (Tree-LSTM/RNN) for building the embedding for phrases? under the CFG, long-term dependency doesn't exist anymore as it's captured by production rules. while the point of LSTM is to capture the long-term dependencies * does it mean GCN is better than Tree-LSTM for building phrasal embeddings?