# Evaluation Results # I. Test Data (complete dataset) ## 1. Data ![](https://i.imgur.com/HMeG1fq.png) ## 2. Qualitative reports <i class="fa fa-file" style="font-size:24px"></i> **[Qualitative report - ELQ Fine-tuned 500 epochs](https://drive.google.com/drive/folders/1ka7MwnU31jILxuhzVysALzZkwdZQIE7d)** ## 3. Metrics ### 3.1 Tagging ![](https://i.imgur.com/29S2J0t.png) ### 3.2 Linking We select correctly tagged items from ELQ (FT) (a subset of the golden dataset) and then provide them to each annotation system to perform the linking tasks. This allows standardizing inputs to all annotators. #### 3.2.1 Tag distribution We divide the tagged items as follows: - **Seen Tags**: Tag present in annotations of both the training set and the test set (where the memorization baseline already has some data). - **Unseen Tags**: AATs present only in annotations of both the test set (is the one where the memorization baseline should be the same as a random baseline as it has no information). - **Mixed Tags**: Includes both Seen and Unseen tags ![](https://i.imgur.com/YR6sVgB.png) > **Example** > > **Dataset** > > | ID | chunk_text | text | > |-------------------|------------|----------------------------------------------| > | BM-A_1936-1012-44 | figure | figure (woman) wearing rainbow dance costume | > | BM-A_1936-1012-44 | costume | figure (woman) wearing rainbow dance costume | > | WCMA-0003401 | amulet | amulet of isis with infant horus | > | BM-A_1992-1214-52 | bronze | figure. made of bronze. | > | BM-A_1992-1214-52 | figure | figure. made of bronze. | > > **Train dataset** > > | ID | chunk_text | text | > |-------------------|------------|----------------------------------------------| > | BM-A_1936-1012-44 | figure | figure (woman) wearing rainbow dance costume | > | BM-A_1936-1012-44 | costume | figure (woman) wearing rainbow dance costume | > | WCMA-0003401 | amulet | amulet of isis with infant horus | > > **Test dataset** > > | ID | chunk_text | text | > |-------------------|------------|----------------------------------------------| > | BM-A_1992-1214-52 | bronze | figure. made of bronze. | > | BM-A_1992-1214-52 | figure | figure. made of bronze. | > > > **Unseen tags**: bronze > **Seen tags**: figure, amulet, costume #### 3.2.2 Results ![](https://i.imgur.com/hQbSuq6.png) ### 3.3 End-to-End ![](https://i.imgur.com/2a9Wriy.png) ## 4. PR curves PR curves are computed using the validation dataset and the following hyper-parameters: - Num. Candidate mentions: 20 - Num candidate entities: 10 - Threshold type: "thresholded entity by mention" - Mention threshold: -1 The range of the evaluation threshold is [-10 to 0] using 1-point increments. Lower thresholds increase the number of output candidates. ### 4.1 ELQ fine-tunning #### 4.1.1 Tagging ![](https://i.imgur.com/WWkl5R0.png) #### 4.1.1 End-to-end ![](https://i.imgur.com/Pcxgpzp.png) ### 4.2 ELQ Off-the-shelf #### 4.2.1 Tagging ![](https://i.imgur.com/NvrOMHI.png) #### 4.2.1 End-to-end ![](https://i.imgur.com/byXrnHd.png) --- # II. Test Data (session 5) **Date:** 07/07/2022 ## 1. Data In this report, annotation systems are evaluated using single-annotated data from session 5 and single-annotated and cross-annotated from sessions 1 to 4. ![](https://i.imgur.com/wVOMza2.png) ## 2. Qualitative reports <i class="fa fa-file" style="font-size:24px"></i> **[Qualitative report]( https://drive.google.com/drive/folders/1ka7MwnU31jILxuhzVysALzZkwdZQIE7d?usp=sharing)** {%pdf https://documentcloud.adobe.com/gsuiteintegration/index.html?state=%7B%22ids%22%3A%5B%221J9mHbHf2ILa9MDhUYXkS6Ga3vYkfC8ZE%22%5D%2C%22action%22%3A%22open%22%2C%22userId%22%3A%22106892680254741911727%22%2C%22resourceKeys%22%3A%7B%7D%7D %} ## 3. Metrics ### 3.1 Tagging ![](https://i.imgur.com/16O4DBT.png) ### 3.2 Linking #### 3.2.1 Tag distribution ![](https://i.imgur.com/d7rUVfL.png) #### 3.2.2 Results ![](https://i.imgur.com/EEnCMed.png) ### 3.3 End-to-End ![](https://i.imgur.com/0OZoAkr.png) ## 4. PR curves ### 4.1 ELQ fine-tunning #### 4.1.1 Tagging ![](https://i.imgur.com/KKKz7Oq.png) #### 4.1.1 End-to-end ![](https://i.imgur.com/mXEJXHf.png) ### 4.2 ELQ Off-the-shelf #### 4.2.1 Tagging ![](https://i.imgur.com/vf9zM2O.png) #### 4.2.1 End-to-end ![](https://i.imgur.com/PVvmTne.png) --- # III. CV Evaluation ## 1. Qualitative reports <i class="fa fa-file" style="font-size:24px"></i> **[Qualitative report]( https://drive.google.com/drive/folders/1io4qTteh-iObBwuSgWYaK4zmDLLlyUH_?usp=sharing)** ## 2. Metrics per fold ## 2.1 Tagging **P**: Precision **R**: Recall **F**: F-score ![](https://i.imgur.com/sITAtE2.png) ## 2.2 Linking **P**: Precision ![](https://i.imgur.com/5cxxkj1.png) ## 2.3 End-to-end **P**: Precision **R**: Recall **F**: F-score ![](https://i.imgur.com/rssMErW.png)