"Double CoNNL-U" format

# "Double CoNNL-U" format 2 files, one for original (CoNLL-U Plus with correction annotations), one for target (regular CoNLL-U). Example: > *Vi will på att du tryggt känner därför vi har ökat över vakningen. This contains several errors: - _will_ instead of _vill_ (O) - unnecessary _på_ (what label does it get? S? L?) - _tryggt_ instead of _trygg_ (M) - _känner_ instead of _ska känna dig_ (M+S?) - word order issues: _trygg_ should go last, _har_ should be swapped with the second _vi_ (S) - missing punctuation before _därför_ (P ). __This also means that _därfor_ should be capitalized but I don't remember if we are to keep track of that and how__ - problems with compounding _över vakningen_ should be written as a single word (L, probably) A possible corrected version of the sentence is: > Vi vill att du ska känna dig trygg. Därför har vi ökat övervakningen. ## Original (CoNLL-U+) ``` # global.columns = ID FORM A D R TID # sent_id = e1s1 1 Vi _ _ _ 1:1 2 will _ _ O 1:2 3 på _ S _ _ 4 att _ _ _ 1:3 5 du _ _ _ 1:4 6 tryggt _ _ M|S 1:8 7 känner SS _ M|S 1:5-7 8 därför P _ _ 2:1 9 vi _ _ S:1 2:3 10 har _ _ S:2 2:2 11 ökat _ _ _ 2:4 12 över _ _ L:1 2:5 13 vakningen _ _ L:2 2:5 14 . _ _ _ 2:6 ``` Where: - __`sent_id`__ is in the form `ensm`, meaning "essay n, sentence m" - __`TID`__ stands for Target ID(s) and indicates the IDs of the (corrected) word(s) __in the target file__. The format is `n:m`, where `n` refers to the sentence and `m` to the token. The essay number can also, but does not need to, be included. For the shared task, this last column can be simply dropped. ## Target (CoNLL-U "Minus") ``` # sent_id = e1s1 1 Vi ... 2 vill ... 3 att ... 4 du ... 5 ska ... 6 känna ... 7 dig ... 8 trygg ... 9 . ... # sent_id = e1s2 1 Därför ... 2 har ... 3 vi ... 4 ökat ... 5 övervakningen ... 6 . ... ``` ## Double CoNNL-U for error patterns A variant of this "Double CoNNL-U" format can also be used as an output format for the pattern extraction step of whatever I'm trying to do. Desired properties: - named patterns (with names generated automatically using A/D/R column, error label and morphosyntactic annotation). Names can be metadata - abstract patterns, i.e. with irrelevant subtrees replaced with placeholders (placeholder names could be their deplabels, so that that info can be used for GF generation) - link between original and target via TIDs - target file must be plain CoNNL-U for later conversion to GF ### Tentative example Inspired to the sentence: ``` # sent_id = eG69GT4s145 1 Paris _ _ _ 145:1 _ Paris nsubj Case=Nom|Definite=Ind|Gender=Neut|Number=Sing _ 5 NOUN _ 2 är _ _ _ 145:2 _ vara cop Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act _ 5 AUX _ 2.1 _ M-Def _ _ 145:3 _ _ _ _ _ _ _ _ 3 fin _ _ _ 145:4 _ fin amod Case=Nom|Definite=Ind|Degree=Pos|Gender=Com|Number=Sing _ 5 ADJ _ 4 stad _ _ _ 145:5 _ stad root Case=Nom|Definite=Ind|Gender=Com|Number=Sing _ 0 NOUN _ 5 . _ _ _ 145:6 _ . punct _ _ 5 PUNCT _ ``` #### Original (CoNNL-U Plus) ``` # global.columns = ID POS HEAD DEPREL TID # pattern_id = missing_det_noun_cop 1 NOUN 3 nsubj 1 2 AUX 3 cop 2 2.1 _ _ _ 3 3 NOUN 0 root 4 ``` Note that: - `TID` becomes a lot simpler because the two files are always pattern aligned, so `TID=2` here means second token of patter `missing_det_noun_cop` - this is surely way too minimal: at least, feats have to be kept for replacements. But the idea is to put underscore whenever there can be a generalization - another idea could be to only keep the correction info, i.e. removing morphosyn annotations from all tokens excepts the incorrect ones #### Target (plain CoNNL-U) ``` # pattern_id = missing_det_noun_cop 1 <nsubj_NOUN> _ NOUN _ _ 4 nsubj _ _ 2 <cop_AUX> _ AUX _ _ 4 cop _ _ 3 <det-DET> _ DET _ _ 4 det _ _ 4 <root-NOUN> _ NOUN _ _ 0 root _ _ ``` ###### tags: `phd` `gec`