# Baselines for MAIA Dataset -- Chat MT
## Pretrained models - w/out finetuning
### Evaluation on agent data -- sentence level w/out context
In this first table, I present the results for the monolingual Marian MT models for the LPs for which they are available (`EN-DE`, `EN-FR`, `EN-ZH`).
| Language Pair | Model | BLEU | COMET | Observations
| -------- | -------- | -------- | ------| ------|
| `en-de` `client 01` | `Helsinki-NLP/opus-mt-en-de` | 46.69 | 0.6192 | 900
| `en-de` `client 02` | `Helsinki-NLP/opus-mt-en-de` | 14.23 | 0.4992 | 2812
| `en-fr` | `Helsinki-NLP/opus-mt-en-fr` | 25.92 | 0.7322 | 1441
| `en-zh` | `Helsinki-NLP/opus-mt-en-zh` | 4.24 | 0.3479 | 647
Finally, I used the MarianMT `en-multi` pretrained model which has support for all LPs but `en-pt_br`. For `en-pt_br`, I am reporting results with `en-pt`.
| Language Pair | BLEU | COMET | Observations
| -------- | -------- | ------| ------|
| `en-de` `client 01` | 28.76 | 0.2678 | 900
| `en-de` `client 02` | 37.31 | 0.4929 | 2812
| `en-fr` | 43.95 | 0.5703 | 1441
| `en-zh` | 12.25 | 0.0905 | 647
| `en-pt` | 33.11 | 0.5947 | 549
| `en-pt_br` `client 02` | 37.29 | 0.7207 | 671
| `en-pt_br` `client 04` | 32.90 | 0.5953 | 893
## Pretrained models - w/ finetuning
Finetuning was done for the following LPs: `en-de client-01`, `en-fr` and `en-zh`.
| Language Pair | Pretrained model | BLEU | COMET | Observations
| -------- | -------- | -------- | ------| ------|
| `en-de` `client 01` | `Helsinki-NLP/opus-mt-en-de` | 78.16 | 0.8452 | 900
| `en-fr` | `Helsinki-NLP/opus-mt-en-fr` | 83.82 | 1.1524 | 1441
| `en-zh` | `Helsinki-NLP/opus-mt-en-zh` | 18.12 | 0.9311 | 647
The finetuned `EN-DE` model on `en-de client-01` was also evaluated on the `en-de` `client 02`: BLEU -- 60.07; COMET -- 0.8608.
## Checklist
- [ ] Finetune a multilingual model on all the data?
- [ ] It seems that there is no need for a lot of data to improve performance (e.g: the `EN-ZH` data used to finetune the original pretrained model only contained ~2-2.5k samples).