Baselines for MAIA Dataset -- Chat MT

# Baselines for MAIA Dataset -- Chat MT ## Pretrained models - w/out finetuning ### Evaluation on agent data -- sentence level w/out context In this first table, I present the results for the monolingual Marian MT models for the LPs for which they are available (`EN-DE`, `EN-FR`, `EN-ZH`). | Language Pair | Model | BLEU | COMET | Observations | -------- | -------- | -------- | ------| ------| | `en-de` `client 01` | `Helsinki-NLP/opus-mt-en-de` | 46.69 | 0.6192 | 900 | `en-de` `client 02` | `Helsinki-NLP/opus-mt-en-de` | 14.23 | 0.4992 | 2812 | `en-fr` | `Helsinki-NLP/opus-mt-en-fr` | 25.92 | 0.7322 | 1441 | `en-zh` | `Helsinki-NLP/opus-mt-en-zh` | 4.24 | 0.3479 | 647 Finally, I used the MarianMT `en-multi` pretrained model which has support for all LPs but `en-pt_br`. For `en-pt_br`, I am reporting results with `en-pt`. | Language Pair | BLEU | COMET | Observations | -------- | -------- | ------| ------| | `en-de` `client 01` | 28.76 | 0.2678 | 900 | `en-de` `client 02` | 37.31 | 0.4929 | 2812 | `en-fr` | 43.95 | 0.5703 | 1441 | `en-zh` | 12.25 | 0.0905 | 647 | `en-pt` | 33.11 | 0.5947 | 549 | `en-pt_br` `client 02` | 37.29 | 0.7207 | 671 | `en-pt_br` `client 04` | 32.90 | 0.5953 | 893 ## Pretrained models - w/ finetuning Finetuning was done for the following LPs: `en-de client-01`, `en-fr` and `en-zh`. | Language Pair | Pretrained model | BLEU | COMET | Observations | -------- | -------- | -------- | ------| ------| | `en-de` `client 01` | `Helsinki-NLP/opus-mt-en-de` | 78.16 | 0.8452 | 900 | `en-fr` | `Helsinki-NLP/opus-mt-en-fr` | 83.82 | 1.1524 | 1441 | `en-zh` | `Helsinki-NLP/opus-mt-en-zh` | 18.12 | 0.9311 | 647 The finetuned `EN-DE` model on `en-de client-01` was also evaluated on the `en-de` `client 02`: BLEU -- 60.07; COMET -- 0.8608. ## Checklist - [ ] Finetune a multilingual model on all the data? - [ ] It seems that there is no need for a lot of data to improve performance (e.g: the `EN-ZH` data used to finetune the original pretrained model only contained ~2-2.5k samples).