# ML Baseline evaluation
Split the whole process into two tasks:
- **TAGGING:** Given a sequence of tokens, return a list of spans that we think refer to an AAT entity
- **LINKING:** Given a span (start, end) of tokens, return an AAT ID that we think it corresponds to.
For both of the above tasks, we would like to have two types of evaluation results:
**SINGLE_ANNOTATED data**
On the data where we only have 1 label (1200 so far), we will take this annotation as the golden label, and evaluate the following against it:
- String matching:
- For TAGGING: “is this span in a surface form for any AAT id”
- For LINKING: “is the single AAT id we left in the index the right one?
- SBERT baseline
- GENRE baseline
**CROSS_ANNOTATED data**
On this data (100-ish) we can take the agreed upon version as the gold label, and have the following evaluated against it:
- String matching (same as above)
- SBERT baseline (same as above)
- GENRE baseline (same as above)
- Every single rater’s individual annotation (4X) - this is what’s new in this set!
Overall, this should leave us with 2X2 = 4 different tables of evaluation data with a number of entries.
----
## Results
## 0 Notes
### 0.1 Metrics:
- Precision
$$COR_a/PRED_a$$
- Recall
$$COR_a/TOTAL_{gs}$$
Where:
$COR_a$: Number of correct predictions provided by the annotator
$PRED_a$: Number of predictions provided by the annotator
$TOTAL_{gs}$: Total observations in the golden set
### 0.2 Tasks
- Tagging Evaluation
The annotation system is provided with a string, and its task is to tag the corresponding entities.
We can use both precision and recall metrics to evaluate annotators as the number of predicted tags could differ with the golden set.
- Linking Evaluation
The annotation system is provided with a span (tagged entity), and its task is to identify the correspondent aat-id.
Since the annotation system will use the tags provided by the golden set, we will evaluate $COR_a/TOTAL_{gs}$. However, for this case, the result is the same as to evaluate $COR_a/PRED_a$.
## 1 Cross annotated-data
This dataset includes common annotations from sessions 1, 2, and 3.
**Data composition:**
- Numer of strings: 101
- Consensus annotations: 408
### 1.1 Tagging Evaluation
<!--
#### 1.1.1.1 Average results for all sessions (1st to 3rd)

- Precision and Recall

- Predicted vs Correct tags (counts)

> The dotted-grey line denotes the number of tags in the golden set.
-->
#### 1.1.1 Average results for all sessions (1st to 4th)
<center>
| | precision | recall | porc_potential_aats | correct_tags | total_predicted |
|:----------------|------------:|---------:|----------------------:|---------------:|------------------:|
| khalil | 92.03 | 90.99 | 1 | 404 | 439 |
| rafa | 93.86 | 82.66 | 1 | 367 | 391 |
| average human | 90.59 | 79.505 | 1 | 353 | 389 |
| sebastian | 88.97 | 78.15 | 1 | 347 | 390 |
| juan | 87.5 | 66.22 | 1 | 294 | 336 |
| string matching | 60.34 | 40.09 | 0.9559 | 178 | 295 |
</center>


#### 1.1.2 Results per session

### 1.2 Linking Evaluation
<!--
#### 1.2.1 Average results for all sessions (1st to 3rd)

- Correct cases (%)

- Predicted vs Correct tags (counts)

-->
#### 1.2 Average results for all sessions (1st to 4th)
<center>
| | perc_correct | count_correct |
|:----------------|---------------:|----------------:|
| khalil | 84 | 374 |
| rafa | 75 | 335 |
| average human | 73 | 324 |
| sebastian | 72 | 318 |
| juan | 61 | 269 |
| SBERT | 60 | 265 |
| string matching | 44 | 195 |
| GENRE | 30 | 131 |
</center>
- Correct cases (%) [out of 444]

#### 1.2.1 Average results for all sessions (1st to 4th)
#### 1.2.2 Results per session

> blue: GENRE
> orange: SBERT
> green: average-human
> red: juan
> purple: khalil
> brown: rafa
> pink: sebastian
> grey: string-matching
## 2 Single annotated-data [status: currently updating data to include session 4]
### 2.1 Tagging-Evaluation

- Precision and Recall

- Prediction counts

### 2.2 Linking Evaluation

- Correct cases (%)

- Correct cases (counts)
