# "Double CoNNL-U" format
2 files, one for original (CoNLL-U Plus with correction annotations), one for target (regular CoNLL-U).
Example:
> *Vi will på att du tryggt känner därför vi har ökat över vakningen.
This contains several errors:
- _will_ instead of _vill_ (O)
- unnecessary _på_ (what label does it get? S? L?)
- _tryggt_ instead of _trygg_ (M)
- _känner_ instead of _ska känna dig_ (M+S?)
- word order issues: _trygg_ should go last, _har_ should be swapped with the second _vi_ (S)
- missing punctuation before _därför_ (P ). __This also means that _därfor_ should be capitalized but I don't remember if we are to keep track of that and how__
- problems with compounding _över vakningen_ should be written as a single word (L, probably)
A possible corrected version of the sentence is:
> Vi vill att du ska känna dig trygg. Därför har vi ökat övervakningen.
## Original (CoNLL-U+)
```
# global.columns = ID FORM A D R TID
# sent_id = e1s1
1 Vi _ _ _ 1:1
2 will _ _ O 1:2
3 på _ S _ _
4 att _ _ _ 1:3
5 du _ _ _ 1:4
6 tryggt _ _ M|S 1:8
7 känner SS _ M|S 1:5-7
8 därför P _ _ 2:1
9 vi _ _ S:1 2:3
10 har _ _ S:2 2:2
11 ökat _ _ _ 2:4
12 över _ _ L:1 2:5
13 vakningen _ _ L:2 2:5
14 . _ _ _ 2:6
```
Where:
- __`sent_id`__ is in the form `ensm`, meaning "essay n, sentence m"
- __`TID`__ stands for Target ID(s) and indicates the IDs of the (corrected) word(s) __in the target file__. The format is `n:m`, where `n` refers to the sentence and `m` to the token. The essay number can also, but does not need to, be included.
For the shared task, this last column can be simply dropped.
## Target (CoNLL-U "Minus")
```
# sent_id = e1s1
1 Vi ...
2 vill ...
3 att ...
4 du ...
5 ska ...
6 känna ...
7 dig ...
8 trygg ...
9 . ...
# sent_id = e1s2
1 Därför ...
2 har ...
3 vi ...
4 ökat ...
5 övervakningen ...
6 . ...
```
## Double CoNNL-U for error patterns
A variant of this "Double CoNNL-U" format can also be used as an output format for the pattern extraction step of whatever I'm trying to do.
Desired properties:
- named patterns (with names generated automatically using A/D/R column, error label and morphosyntactic annotation). Names can be metadata
- abstract patterns, i.e. with irrelevant subtrees replaced with placeholders (placeholder names could be their deplabels, so that that info can be used for GF generation)
- link between original and target via TIDs
- target file must be plain CoNNL-U for later conversion to GF
### Tentative example
Inspired to the sentence:
```
# sent_id = eG69GT4s145
1 Paris _ _ _ 145:1 _ Paris nsubj Case=Nom|Definite=Ind|Gender=Neut|Number=Sing _ 5 NOUN _
2 är _ _ _ 145:2 _ vara cop Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act _ 5 AUX _
2.1 _ M-Def _ _ 145:3 _ _ _ _ _ _ _ _
3 fin _ _ _ 145:4 _ fin amod Case=Nom|Definite=Ind|Degree=Pos|Gender=Com|Number=Sing _ 5 ADJ _
4 stad _ _ _ 145:5 _ stad root Case=Nom|Definite=Ind|Gender=Com|Number=Sing _ 0 NOUN _
5 . _ _ _ 145:6 _ . punct _ _ 5 PUNCT _
```
#### Original (CoNNL-U Plus)
```
# global.columns = ID POS HEAD DEPREL TID
# pattern_id = missing_det_noun_cop
1 NOUN 3 nsubj 1
2 AUX 3 cop 2
2.1 _ _ _ 3
3 NOUN 0 root 4
```
Note that:
- `TID` becomes a lot simpler because the two files are always pattern aligned, so `TID=2` here means second token of patter `missing_det_noun_cop`
- this is surely way too minimal: at least, feats have to be kept for replacements. But the idea is to put underscore whenever there can be a generalization
- another idea could be to only keep the correction info, i.e. removing morphosyn annotations from all tokens excepts the incorrect ones
#### Target (plain CoNNL-U)
```
# pattern_id = missing_det_noun_cop
1 <nsubj_NOUN> _ NOUN _ _ 4 nsubj _ _
2 <cop_AUX> _ AUX _ _ 4 cop _ _
3 <det-DET> _ DET _ _ 4 det _ _
4 <root-NOUN> _ NOUN _ _ 0 root _ _
```
###### tags: `phd` `gec`