Dependency/Stripped Matrix Aggregation

# Dependency/Stripped Matrix Aggregation ## Background ### Revisiting "syntax-awareness" * Means: knows the syntax and can proceed with later processing accordingly 1. The model have syntax analyzer built in 2. The model need external analyzer to tell what the syntactic structure should look like ✅ ### Background: Pipeline vs. end-to-end * Natural Language Processing’s recent trend has been training a model to solve problems in an end-to-end manner * Good in terms of quantitative measure * Bad for heavily overfitting the target task & domain * end-to-end model pertained in a large corpus can spontaneously recover the dependency structure * BERT do "agree" with human on the dependency structure. -> At least the dependency analysis is useful? * BERT-alike is very resource-hungry and computationally heavy (drawback) ### Goal? #### What i want to do * reconnect the computational linguistic analyses and Natural Language Processing models * Improving the generalizability of end-to-end based approaches * Improving the data efficiency of the language-model pretraining methods #### Via? * Injecting dependency structure as "inductive bias" and let the model learn the rest. ---------------- #### Another way to look at this: Replacing trainable NN with known operator * [this](https://arxiv.org/abs/1907.01992) says if we replace trainable NN with a known operator, we can reduce error bound... * BERT demonstrated behavior similar to dependency parsing via attention matrix in the middle some layers. (Dependency parsing operator?) * -> Can we replace the "learned" dependency parsing operator with external parser? for data efficency. ## Motivations ### Model-Data Parallelism #### Dependency Graph - Self-Attention - Dependency Graph is a Direct Acyclic Graph - Self-Attention is a computational model for fully connected graphs ##### Q: what information dependency label carries? ##### Q: what information dependency arc carries? ### Aggregation may reduce the parsing errors [A preliminary study](/NAfqGmqMTi6Pq_njJn7-IA) ## Generating soft dependency constraints ### Reason - allowing information to pass despite the wrong dependency prediction - Allow computing gradient ### Goal quality - subject to optimization - approximate distribution of head ### applying a distribution over heads - Gaussian distribution [Dep-enhanced MT](http://arxiv.org/abs/1909.03149) #### Denoising - Parent ignoring [Dep-enhanced MT](http://arxiv.org/abs/1909.03149) - ### aggregating multiple parsers - play under some context, i.e. the aggregator can see a feature vector $v(s)$ #### Candidata parsers * biaffine * CRFNP * CRF2O * GNN parser  -------- * Dyer's LSTM dependency parser * Stanford MLP transition parser #### Linear aggregator - integrate multiple dependency graphs - per-graph - Do some parser do better on some sentence while worse on others? - per-token - Do some parser do better on some token while worse on others? #### Bandit selector - can only select one dependency graph at a time ### $k$-reachablity - Extension of dependency graph - Enabling tokens to see tokens few arcs away? - content-head (UD) vs. function-head grammar (stanford) - Should be useful if there're mediators ## Dependency-injected transformer ### Experiment - ! relative embedding #### reproducing LISA performance - TODO: Exact reproduction of LISA/SA - goal: ~82 f1 on dev set - Attempts - Parameter tuning ❌ - barely moves - Increasing model size ❓︎ - LISA improves from 80.80 -> 81.20 - Jumping back and forth with optimziers ❌ - [sebastian's note](https://ruder.io/optimizing-gradient-descent/) - 0.82 performance is usually achieved in low batch_size models? - batch_size variable optimizer? - related to token-based batch size selection - ✅ - Conclusion: simply not having enough training step? | Model | P | R | F1 | | P | R | F1 | | P | R | F1 | | --------------------------- | ------- | ------- | ------- | --- | ----- | ----- | ----- | --- | ----- | ----- | ----- | | `sa-conll05` | 83.52 | 81.28 | 82.39 | | 84.17 | 83.28 | 83.72 | | 72.98 | 70.10 | 71.51 | | `lisa-conll05` | 83.10 | 81.39 | 82.24 | | 84.07 | 83.16 | 83.61 | | 73.32 | 70.56 | 71.91 | | `lisa-conll05` +D&M | 84.44 | 82.89 | 83.66 | | 85.98 | 84.85 | 85.41 | | 75.93 | 73.45 | 74.67 | | `lisa-conll05` *+Gold* | *87.91* | *85.73* | *86.81* | | --- | --- | --- | | --- | --- | --- | |SA|||80.18||||83.78||||71.96| |g-LISA|||80.80||||85.65||||71.58| ##### LISA | Model | P | R | F1 | | P | R | F1 | | P | R | F1 | | --------------------------- | ------- | ------- | ------- | --- | ----- | ----- | ----- | --- | ----- | ----- | ----- | | `lisa` -10layers -base| | | 81.65 | | | |83.39 | | | |72.01 | | `lisa` -10layers -large| | | 81.89 | | | | 83.86| | | | *70.92* | | `lisa` -12layers -base| | | | | | | | | | | | | `lisa` -12layers -large| | | | | | | | | | | | - model-map - `lisa` (10layers, base) : .model-lisa/run-base - `lisa` (10layers, large) : .model-lisa/run-large - LISA-base over 0.82 - random_seed=1613120601,epsilon=1e-12,learning_rate=0.06,moving_average_decay=0.0,batch_size=160,beta2=0.98,beta1=0.9,average_norms=False,optimizer=lazyadam,gradient_clip_norm=1.2,early_stopping=True (best?) - random_seed=1613120602,epsilon=1e-12,learning_rate=0.06,moving_average_decay=0.0,batch_size=160,beta2=0.98,beta1=0.9,average_norms=False,optimizer=adam,gradient_clip_norm=1.2,early_stopping=True (2) ##### LISA + parses from external parsers | Model | P | R | F1 | | P | R | F1 | | P | R | F1 | | --------------------------- | ------- | ------- | ------- | --- | ----- | ----- | ----- | --- | ----- | ----- | ----- | |`LISA-base` + gold injection|||86.20||||88.52||||76.83| |LISA|||81.65||||83.39||||72.01| |`l-LISA-base` + l-injection|||82.52||||84.80||||73.07| |||||||||| |`LISA-base` + biaffine injection|||*81.07*||||83.97||||72.17| |`LISA-base` + mean injection|||*81.15*||||84.06||||72.43| |`LISA-base` + crf2o injection|||*81.49*||||84.06||||72.52| |`LISA-base` + crfnp injection|||81.45||||83.53||||71.99| |`LISA-large` + biaffine injection|||81.78||||84.25 ||||72.30| ##### LISA + internal predicted parse + injection/discounting | Model | P | R | F1 | | P | R | F1 | | P | R | F1 | | --------------------------- | ------- | ------- | ------- | --- | ----- | ----- | ----- | --- | ----- | ----- | ----- | |`Experiment Name`|||-||||-||||-| |`LISA-base`|||82.17/82.19||||83.94/83.83||||72.04/72.90| |`LISA-large`|||82.41||||84.18||||72.31| |`LISA-base-12l`|||81.99||||83.65||||71.28| |`LISA-base` + my discounting|||81.95/82.11||||83.77/83.91||||72.71/72.48| |`LISA-base` + my discounting +mh|||81.90||||84.01||||72.68| |`LISA-base` + okazaki discounting|||82.14||||83.67||||71.53| |`LISA-base` + okazaki discounting +mh|||81.60/81.89||||83.68/83.70||||71.84/71.35| - `LISA-base-12l`: run2 - `LISA-large`: run1 - `LISA-base` + my discounting +mh: run-2 - `LISA-base` + okazaki discounting: run-1 ##### Dependency Injection | Model | P | R | F1 | | P | R | F1 | | P | R | F1 | | --------------------------- | ------- | ------- | ------- | --- | ----- | ----- | ----- | --- | ----- | ----- | ----- | |`Experiment Name`|||-||||-||||-| |`l-LISA-base` + l-injection|||82.52||||84.80||||73.07| |`l-LISA-large` + l-injection|||82.80||||85.04||||72.56| |`l-LISA-base-10` + l-injection|||-||||-||||-| |||||||||| #### ELMos | Model | P | R | F1 | | P | R | F1 | | P | R | F1 | | --------------------------- | ------- | ------- | ------- | --- | ----- | ----- | ----- | --- | ----- | ----- | ----- | | `sa-conll05-elmo` -reported | 85.78 | 84.74 | 85.26 | | 86.21 | 85.98 | 86.09 | | 77.10 | 75.61 | 76.35 | | `lisa-conll05-elmo` | 86.07 | 84.64 | 85.35 | | 86.69 | 86.42 | 86.55 | | 78.95 | 77.17 | 78.05 | |||||||||| |`lisa-elmo` |||85.67||||86.68||||78.90| |`lisa-elmo` + l-injection |||85.22||||86.67||||78.04| |||||||||| |`bi-lstm-elmo` + syn-MTL |||----||||88.2||||79.3| |`bi-lstm-elmo` |||----||||87.7||||78.1| #### Others | Model | P | R | F1 | | P | R | F1 | | P | R | F1 | | --------------------------- | ------- | ------- | ------- | --- | ----- | ----- | ----- | --- | ----- | ----- | ----- | |MTL|||---||||---||||---| |LISA|||81.65||||83.39||||72.01| |`LISA` + gold injection|||86.20||||88.52||||---| ##### Details #### Injection  ##### Conlustion:? - No need for parsing supervision if external injection is viable -  ##### Questions: - why LISA do worse than biaffine injection in PTB but better in BROWN? - what's the relation between no-supervision vs. dependency-head supervision only? - what's the linking hypothesis of the model's performance - UAS/LAS? (F1) - Cross-Entropy? (Accuracy) #### Q: individual vs. aggregation vs. gold | Model | P | R | F1 | | P | R | F1 | | P | R | F1 | | --------------------------- | ------- | ------- | ------- | --- | ----- | ----- | ----- | --- | ----- | ----- | ----- | |`g-LISA` + gold injection|||86.68||||88.88||||77.51| |`a-LISA` + mean aggregation|||~82.2||||-||||-| |`a-LISA` + mean aggregation|||~82.4||||-||||-| |`g-LISA` + biaffine injection|||81.35||||85.54||||71.59| |`g-LISA` + crfnp injection|||81.53||||83.97||||71.44| #### Q: source dependency graph? - from individual parsers - fully connected graph       ## Dependency-discounting LISA ### Central idea: * discounting attention matrix via 1. first get the directed k-reachable matrix, and then generate the undirected graph * $D_{r}:=Dep^{k}$ * $D:=D_r+D_r^T$ 2. generate the k-reachable graph based on the undirected graph. * $D_{b}:==Dep+Dep^T$ * $D:=D_b^k$ ### Experiments #### Q: Discounting vs. Injection #### Q: Layer to perform discounting? #### Q: source dependency grph?