## Evaluation Results (06-27-2022) ## Data description The dataset used for evaluation is composed of cross-annotated data from sessions 1 to 3. The dataset contains 105 different strings and 408 annotations. Each annotation was discussed and validated as a “golden standard label.” > Important note: We agreed that the final dataset split should be as follows: 4 first sessions for training, testing on the fifth session, and cross-validation by time. However, we still need the clean version of the data to do that. For evaluation, currently, we use golden data from sessions 1 to 3. In the training step, we are doing a sampling that includes part of that data. Therefore, there could still be some overlap between training and testing data. ## Evaluation Results Notes reagarding ELQ models: - ELQ fine-tuned version uses a "joint" threshold of -1.5 - ELQ fine-tuned latest version uses a "thresholded_entity_by_mention" of -1.5 - ELQ off-the-shelf uses a "joint" threshold of -4.5 **"Higher threholds decreases the number of output candidates (up to 0.0). Lower thresholds increases the number of output candidates."** PR curves evaluate from 0.0 to -10 ### Tagging ![](https://i.imgur.com/Wu9xfBb.png) ### Linking ![](https://i.imgur.com/rr1JkXG.png) ### End-to-end ![](https://i.imgur.com/0WdiFJc.png) ## PR curves ### Tagging **Off-the-shelf model** ![](https://i.imgur.com/U6Jv4EI.png) **Fine-tuned model** ![](https://i.imgur.com/S9LBZme.png) ### End-to-end **Off-the-shelf model** ![](https://i.imgur.com/Ustr13j.png) **Fine-tuned model** ![](https://i.imgur.com/kArNwWL.png)