## Evaluation Results (06-27-2022)
## Data description
The dataset used for evaluation is composed of cross-annotated data from sessions
1 to 3. The dataset contains 105 different strings and 408 annotations.
Each annotation was discussed and validated as a “golden standard label.”
> Important note: We agreed that the final dataset split should be as follows: 4 first sessions for training, testing on the fifth session, and cross-validation by time. However, we still need the clean version of the data to do that.
For evaluation, currently, we use golden data from sessions 1 to 3. In the training step, we are doing a sampling that includes part of that data. Therefore, there could still be some overlap between training and testing data.
## Evaluation Results
Notes reagarding ELQ models:
- ELQ fine-tuned version uses a "joint" threshold of -1.5
- ELQ fine-tuned latest version uses a "thresholded_entity_by_mention" of -1.5
- ELQ off-the-shelf uses a "joint" threshold of -4.5
**"Higher threholds decreases the number of output candidates (up to 0.0). Lower thresholds increases the number of output candidates."**
PR curves evaluate from 0.0 to -10
### Tagging

### Linking

### End-to-end

## PR curves
### Tagging
**Off-the-shelf model**

**Fine-tuned model**

### End-to-end
**Off-the-shelf model**

**Fine-tuned model**
