# Dependency/Stripped Matrix Aggregation
## Background
### Revisiting "syntax-awareness"
* Means: knows the syntax and can proceed with later processing accordingly
1. The model have syntax analyzer built in
2. The model need external analyzer to tell what the syntactic structure should look like ✅
### Background: Pipeline vs. end-to-end
* Natural Language Processing’s recent trend has been training a model to solve problems in an end-to-end manner
* Good in terms of quantitative measure
* Bad for heavily overfitting the target task & domain
* end-to-end model pertained in a large corpus can spontaneously recover the dependency structure
* BERT do "agree" with human on the dependency structure. -> At least the dependency analysis is useful?
* BERT-alike is very resource-hungry and computationally heavy (drawback)
### Goal?
#### What i want to do
* reconnect the computational linguistic analyses and Natural Language Processing models
* Improving the generalizability of end-to-end based approaches
* Improving the data efficiency of the language-model pretraining methods
#### Via?
* Injecting dependency structure as "inductive bias" and let the model learn the rest.
----------------
#### Another way to look at this: Replacing trainable NN with known operator
* [this](https://arxiv.org/abs/1907.01992) says if we replace trainable NN with a known operator, we can reduce error bound...
* BERT demonstrated behavior similar to dependency parsing via attention matrix in the middle some layers. (Dependency parsing operator?)
* -> Can we replace the "learned" dependency parsing operator with external parser? for data efficency.
## Motivations
### Model-Data Parallelism
#### Dependency Graph - Self-Attention
- Dependency Graph is a Direct Acyclic Graph
- Self-Attention is a computational model for fully connected graphs
##### Q: what information dependency label carries?
##### Q: what information dependency arc carries?
### Aggregation may reduce the parsing errors
[A preliminary study](/NAfqGmqMTi6Pq_njJn7-IA)
## Generating soft dependency constraints
### Reason
- allowing information to pass despite the wrong dependency prediction
- Allow computing gradient
### Goal quality
- subject to optimization
- approximate distribution of head
### applying a distribution over heads
- Gaussian distribution [Dep-enhanced MT](http://arxiv.org/abs/1909.03149)
#### Denoising
- Parent ignoring [Dep-enhanced MT](http://arxiv.org/abs/1909.03149)
-
### aggregating multiple parsers
- play under some context, i.e. the aggregator can see a feature vector $v(s)$
#### Candidata parsers
* biaffine
* CRFNP
* CRF2O
* GNN parser
<!-- * Dyer's LSTM -->
--------
* Dyer's LSTM dependency parser
* Stanford MLP transition parser
#### Linear aggregator
- integrate multiple dependency graphs
- per-graph
- Do some parser do better on some sentence while worse on others?
- per-token
- Do some parser do better on some token while worse on others?
#### Bandit selector
- can only select one dependency graph at a time
### $k$-reachablity
- Extension of dependency graph
- Enabling tokens to see tokens few arcs away?
- content-head (UD) vs. function-head grammar (stanford)
- Should be useful if there're mediators
## Dependency-injected transformer
### Experiment
- ! relative embedding
#### reproducing LISA performance
- TODO: Exact reproduction of LISA/SA
- goal: ~82 f1 on dev set
- Attempts
- Parameter tuning ❌
- barely moves
- Increasing model size ❓︎
- LISA improves from 80.80 -> 81.20
- Jumping back and forth with optimziers ❌
- [sebastian's note](https://ruder.io/optimizing-gradient-descent/)
- 0.82 performance is usually achieved in low batch_size models?
- batch_size variable optimizer?
- related to token-based batch size selection
- ✅
- Conclusion: simply not having enough training step?
| Model | P | R | F1 | | P | R | F1 | | P | R | F1 |
| --------------------------- | ------- | ------- | ------- | --- | ----- | ----- | ----- | --- | ----- | ----- | ----- |
| `sa-conll05` | 83.52 | 81.28 | 82.39 | | 84.17 | 83.28 | 83.72 | | 72.98 | 70.10 | 71.51 |
| `lisa-conll05` | 83.10 | 81.39 | 82.24 | | 84.07 | 83.16 | 83.61 | | 73.32 | 70.56 | 71.91 |
| `lisa-conll05` +D&M | 84.44 | 82.89 | 83.66 | | 85.98 | 84.85 | 85.41 | | 75.93 | 73.45 | 74.67 |
| `lisa-conll05` *+Gold* | *87.91* | *85.73* | *86.81* | | --- | --- | --- | | --- | --- | --- |
|SA|||80.18||||83.78||||71.96|
|g-LISA|||80.80||||85.65||||71.58|
##### LISA
| Model | P | R | F1 | | P | R | F1 | | P | R | F1 |
| --------------------------- | ------- | ------- | ------- | --- | ----- | ----- | ----- | --- | ----- | ----- | ----- |
| `lisa` -10layers -base| | | 81.65 | | | |83.39 | | | |72.01 |
| `lisa` -10layers -large| | | 81.89 | | | | 83.86| | | | *70.92* |
| `lisa` -12layers -base| | | | | | | | | | | |
| `lisa` -12layers -large| | | | | | | | | | | |
- model-map
- `lisa` (10layers, base) : .model-lisa/run-base
- `lisa` (10layers, large) : .model-lisa/run-large
- LISA-base over 0.82
- random_seed=1613120601,epsilon=1e-12,learning_rate=0.06,moving_average_decay=0.0,batch_size=160,beta2=0.98,beta1=0.9,average_norms=False,optimizer=lazyadam,gradient_clip_norm=1.2,early_stopping=True (best?)
- random_seed=1613120602,epsilon=1e-12,learning_rate=0.06,moving_average_decay=0.0,batch_size=160,beta2=0.98,beta1=0.9,average_norms=False,optimizer=adam,gradient_clip_norm=1.2,early_stopping=True (2)
##### LISA + parses from external parsers
| Model | P | R | F1 | | P | R | F1 | | P | R | F1 |
| --------------------------- | ------- | ------- | ------- | --- | ----- | ----- | ----- | --- | ----- | ----- | ----- |
|`LISA-base` + gold injection|||86.20||||88.52||||76.83|
|LISA|||81.65||||83.39||||72.01|
|`l-LISA-base` + l-injection|||82.52||||84.80||||73.07|
||||||||||
|`LISA-base` + biaffine injection|||*81.07*||||83.97||||72.17|
|`LISA-base` + mean injection|||*81.15*||||84.06||||72.43|
|`LISA-base` + crf2o injection|||*81.49*||||84.06||||72.52|
|`LISA-base` + crfnp injection|||81.45||||83.53||||71.99|
|`LISA-large` + biaffine injection|||81.78||||84.25 ||||72.30|
##### LISA + internal predicted parse + injection/discounting
| Model | P | R | F1 | | P | R | F1 | | P | R | F1 |
| --------------------------- | ------- | ------- | ------- | --- | ----- | ----- | ----- | --- | ----- | ----- | ----- |
|`Experiment Name`|||-||||-||||-|
|`LISA-base`|||82.17/82.19||||83.94/83.83||||72.04/72.90|
|`LISA-large`|||82.41||||84.18||||72.31|
|`LISA-base-12l`|||81.99||||83.65||||71.28|
|`LISA-base` + my discounting|||81.95/82.11||||83.77/83.91||||72.71/72.48|
|`LISA-base` + my discounting +mh|||81.90||||84.01||||72.68|
|`LISA-base` + okazaki discounting|||82.14||||83.67||||71.53|
|`LISA-base` + okazaki discounting +mh|||81.60/81.89||||83.68/83.70||||71.84/71.35|
- `LISA-base-12l`: run2
- `LISA-large`: run1
- `LISA-base` + my discounting +mh: run-2
- `LISA-base` + okazaki discounting: run-1
##### Dependency Injection
| Model | P | R | F1 | | P | R | F1 | | P | R | F1 |
| --------------------------- | ------- | ------- | ------- | --- | ----- | ----- | ----- | --- | ----- | ----- | ----- |
|`Experiment Name`|||-||||-||||-|
|`l-LISA-base` + l-injection|||82.52||||84.80||||73.07|
|`l-LISA-large` + l-injection|||82.80||||85.04||||72.56|
|`l-LISA-base-10` + l-injection|||-||||-||||-|
||||||||||
#### ELMos
| Model | P | R | F1 | | P | R | F1 | | P | R | F1 |
| --------------------------- | ------- | ------- | ------- | --- | ----- | ----- | ----- | --- | ----- | ----- | ----- |
| `sa-conll05-elmo` -reported | 85.78 | 84.74 | 85.26 | | 86.21 | 85.98 | 86.09 | | 77.10 | 75.61 | 76.35 |
| `lisa-conll05-elmo` | 86.07 | 84.64 | 85.35 | | 86.69 | 86.42 | 86.55 | | 78.95 | 77.17 | 78.05 |
||||||||||
|`lisa-elmo` |||85.67||||86.68||||78.90|
|`lisa-elmo` + l-injection |||85.22||||86.67||||78.04|
||||||||||
|`bi-lstm-elmo` + syn-MTL |||----||||88.2||||79.3|
|`bi-lstm-elmo` |||----||||87.7||||78.1|
#### Others
| Model | P | R | F1 | | P | R | F1 | | P | R | F1 |
| --------------------------- | ------- | ------- | ------- | --- | ----- | ----- | ----- | --- | ----- | ----- | ----- |
|MTL|||---||||---||||---|
|LISA|||81.65||||83.39||||72.01|
|`LISA` + gold injection|||86.20||||88.52||||---|
##### Details
#### Injection
<!-- Semantic role labeling F1
|Model|DEV|TEST-WSJ*|TEST-BROWN*|
|-|-|-|-|
|||||
|SA|80.18|83.78|71.96|
|MTL|80.83|85.67|72.69|
|||||
|g-LISA|80.80|85.65|71.58|
|g-LISA w/o label supervision*|80.84|85.03|71.63|
|||||
|g-LISA w/ gold injection|86.11|88.09|77.37|
|g-LISA w/ gold injection w/o label supervision|86.15|87.95|77.10|
|g-LISA w/ gold injection w/o parsing supervision|86.68|88.88|77.51|
|||||
|g-LISA w/ biaffine injection|81.58|85.60|72.26|
|g-LISA w/ biaffine injection w/o label supervision|81.47|84.78|71.71|
|g-LISA w/ biaffine injection w/o parse supervision|81.35|85.54|71.59|
|||||
|g-LISA w/ crfnp injection|81.50|83.23|71.35|
|g-LISA w/ crfnp injection injection w/o label supervision|81.16|83.80|71.97|
|g-LISA w/ crfnp injection injection w/o parse supervision|81.53|83.97|71.44| -->
##### Conlustion:?
- No need for parsing supervision if external injection is viable
-
<!-- |g-LISA w/o label supervision (strangely good)|81.42|85.27|72.12| -->
##### Questions:
- why LISA do worse than biaffine injection in PTB but better in BROWN?
- what's the relation between no-supervision vs. dependency-head supervision only?
- what's the linking hypothesis of the model's performance
- UAS/LAS? (F1)
- Cross-Entropy? (Accuracy)
#### Q: individual vs. aggregation vs. gold
| Model | P | R | F1 | | P | R | F1 | | P | R | F1 |
| --------------------------- | ------- | ------- | ------- | --- | ----- | ----- | ----- | --- | ----- | ----- | ----- |
|`g-LISA` + gold injection|||86.68||||88.88||||77.51|
|`a-LISA` + mean aggregation|||~82.2||||-||||-|
|`a-LISA` + mean aggregation|||~82.4||||-||||-|
|`g-LISA` + biaffine injection|||81.35||||85.54||||71.59|
|`g-LISA` + crfnp injection|||81.53||||83.97||||71.44|
#### Q: source dependency graph?
- from individual parsers
- fully connected graph
<!-- ||||||||||
|`LISA` -label supervision|||80.84||||85.03||||71.63|
|`LISA` -label supervision + gold injection|||86.15||||87.95||||77.10|
|`LISA` -label supervision + biaffine injection|||81.47||||84.78||||71.71|
|`LISA` -label supervision + crfnp injection|||81.16||||83.80||||71.97|
||||||||||
|`LISA` -parsing supervision + gold injection|||86.68||||88.88||||77.51|
|`LISA` -parsing supervision + biaffine injection|||81.35||||85.54||||71.59|
|`LISA` -parsing supervision + crfnp injection|||81.53||||83.97||||71.44| -->
<!-- #### Question: How does the injection quality influence the performance?
Semantic role labeling F1
|Model|DEV|TEST-WSJ*|TEST-BROWN*|
|-|-|-|-|
|||||
|g-LISA w/ gold injection|86.11|88.09|77.37|
|g-LISA w/ gold injection w/o parsing supervision|86.68|88.88|77.51|
|||||
|b-LISA w/ biaffine-injection|81.71|84.90|71.28|
|b-LISA w/ biaffine-injection w/o parse supervision|81.75|85.60|70.96|
|||||
|b-LISA w/ crfnp-injection|81.16|84.52|72.55|
|b-LISA w/ crfnp-injection w/o parse supervision|81.75|85.60|72.04|
|||||
|b-LISA w/ gold-injection|85.20|87.17|75.93|
|b-LISA w/ gold-injection w/o parse supervision|-|-|-|
|||||
|m-LISA w/ mean-aggregation |-|-|-|
|m-LISA w/ mean-aggregation w/o parse supervision|-|-|-|
|||||
|l-LISA w/ linear-aggregation |-|-|-|
|l-LISA w/ linear-aggregation w/o parse supervision|-|-|-|
|||||
|s-LISA w/ stochastic-aggregation|-|-|-|
|s-LISA w/ stochastic-aggregation w/o parse supervision|-|-|-|
#### Conjectures:
- Parse supervision has little effect, injection is the biggest contributor
- Higher quality parse -> better srl performance
-->
<!-- #### Baseline: Self-attention
#### LISA: what role does label supervision do?
* the following experiment assumes LISA will use the prediction from either its own prediction or an external parser, or better off a gold parse.
##### Experiment 1. LISA (done by LISA)
|Section | srl_f1|head_accuracy|
|--------|-------|--|
|dev |80.62|93.05|
|test.wsj* |83.17 | 94.72|
|test.brown*|72.13 |90|
##### Experiment 2. LISA-gold (done by LISA)
|Section | srl_f1|head_accuracy|
|--------|-------|--|
|dev |86.11| 93.20|
|test.wsj* |88.40 | 95.03|
|test.brown*|76.73 |89.84|
##### Experiment 3. LISA-biaffine (no pre-training)
##### Experiment 3. LISA without label supervision
|Section | srl_f1|head_accuracy|
|--------|-------|--|
|dev |80.68| 93.1|
##### Experiment 4. LISA without label supervision & gold parse supplied in eval
|Section | srl_f1|head_accuracy|
|--------|-------|--|
|dev |?| ?|
|test.wsj* |? | ?|
|test.brown*|? |?|
-->
<!-- #### LISA: Can parse supervision be omitted if gold parse is supplied?
-->
<!-- ##### Experiment 3. LISA with stanford SR (or better off SR parser) parser
##### Experiment 4. LISA with CRF parser
##### Experiment 5. LISA with aggregated prediction
#### LISA without dependency-head/label objective,
* the following experiment assumes LISA will use the prediction from either its own prediction or an external parser, or better off a gold parse.
##### Experiment 2. LISA with biaffine prediction
##### Experiment 3. LISA with stanford SR (or better off SR parser) parser
##### Experiment 4. LISA with CRF parser
##### Experiment 5. LISA with aggregated prediction -->
<!-- ### Notes
* [Linguistically-informed self-attention](https://arxiv.org/pdf/1804.08199.pdf)
#### LISA-basic
`{
"joint_pos_predicate": 2,
"parse_head": 4,
"parse_label": 4,
"parse_attention": 5,
"srl": 11
}`
* do joint pos-predicate supervision at 2nd layer
* do parse_head & label supervision at 4th layer
* do srl supervision at 11th layer
* Use predicted dependency graph as "special attention" to layer 5, using a special attention head
##### Strange things
* ~~why in the world did they just literally "inject" the dependency for the "dependency attention matrix", leaving everything else intact (since the output of the 4-th layer is also computed using the "attention matrix" for prediction) ?~~
-->
## Dependency-discounting LISA
### Central idea:
* discounting attention matrix via
1. first get the directed k-reachable matrix, and then generate the undirected graph
* $D_{r}:=Dep^{k}$
* $D:=D_r+D_r^T$
2. generate the k-reachable graph based on the undirected graph.
* $D_{b}:==Dep+Dep^T$
* $D:=D_b^k$
### Experiments
#### Q: Discounting vs. Injection
#### Q: Layer to perform discounting?
#### Q: source dependency grph?
<!-- ### Motivation
* One arc at a time is too slow, let's allow transformer to look at tokens multiple arcs away
* Transformer matrix is basically on 2-$\infty$ configuration
* LISA-dependency injection is basically 1-1 configuration
---------------------
* Regulating transformer by not allowing it to attend to unrelated tokens
* She eats a beautiful fish
* (She, beautiful) isn't linked in any sense
* BERT learns dependency via attention, can we do the opposite? (injecting dependency so that the model don't have to discover the structure by itself -> data efficiency)
* Forcing the transformer to attend in accordance to dependency structure (Of course, it knows the dependency structure, though the structure may not be true)
* May give an explaination to many of the syntactic phenomenon
* Subject-verb agreement
* Filler-gap dependency
* etc.
* In linguistic theories, the requirements/aggrements occur between dependent & head -->
<!-- ### Contribution:
* study to what extend should one collapse the dependency length.
* One selling point of transformer is that it can "capture" long dependency in O(1) time -->
<!-- ### Deficit on the motivation
Loss in flexibility
* subject/object "relation" established via copula (may be solvable via 2-configuration)
* She is beautiful
* (She, beautiful) is linked via copula
* relation(dependent, head)
* cop(beautiful, is)
* nsubj(She, is)
* while in semantics, should it be is(she, beautiful)
* coreference?
* -> does it mean it can only apply to several sub-layers?
-->