# Texterra
## Links
+ [Texterra Description](https://www.ispras.ru/en/projects/texterra/)
+ [Demo](https://texterra.ispras.ru/demo)
## Modules
+ Information Extraction (Извлечение информации)
+ NER (Именованные сущности)
+ Wikification (Викификация): Wikipedia-based Entity Linking
+ Key Named Entities :question: (Ключевые сущности)
+ Lurkification ([Луркификация](https://en.wikipedia.org/wiki/Lurkmore)): Lurkmore-based Entity Linking
+ Linguistic (Лингвистика)
+ Morphological Analysis (Морфология)
+ Syntactic Analysis (Синтаксис)
+ Coreference Resolution (Кореферентности)
+ Error Correction (Исправление ошибок)
+ Sentiment Analysis (Анализ мнений)
+ Opinion Mining (Определение тональности)
## Tasks
### Named Entity Recognition
> Named Entity Recognition (NER) is the task of finding named entities in a given text then classifying them into pre-defined tags.
There are two most common approaches to NER task. The first one forms NER task as
```Sequence Labeling``` task whose purpose is to label each token in a given text with one (or several) pre-defined tag. For example, given a text
==Burtsev Mikhail Sergeevich lives in Saint Petersburg==
and four pre-defined tags: ```PER```, ```ORG```, ```LOC```, and ```MISC``` with the ```IOB``` tagging scheme, a Sequence Labeling-based model should label each tokens as follow:
```c
| Burtsev | Mikhail | Sergeevich | lives | in | Saint | Petersburg |
| B-PER | I-PER | I-PER | O | O | B-LOC | I-LOC |
```
The second approach treats NER task as the task of ```Sequence-to-Sequence``` like Machine Translation, Text Summarization.
#### Evaluation Metrics
[Precision, Recall, and F-measure](https://en.wikipedia.org/wiki/Precision_and_recall).
#### SOTA NER Models
| Model | Architecture | F1[^1] | Paper | Code |
| --- | --- | :---: | --- | --- |
| ACE | Automated Concatnation of Embeddings + Document Context | 94.6 | [Automated Concatnation of Embeddings for Structured Prediction](https://arxiv.org/pdf/2010.05006.pdf) | [code](https://github.com/Alibaba-NLP/ACE) |
| LUKE | Deep Contextualized Representation + Entity-aware Self-attention | 94.3 | [LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](https://www.aclweb.org/anthology/2020.emnlp-main.523/) | [code](https://github.com/studio-ousia/luke) |
| CL-KL | External Context + Transformer + CRF | 93.9 | [Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning](https://arxiv.org/abs/2105.03654) | [code](https://github.com/Alibaba-NLP/CLNER)
[^1]: F1(%) on CoNLL2003 dataset
#### Vietnamese NER models
1. [Bi-LSTM + CNN + CRF](https://paperswithcode.com/paper/a-deep-neural-network-model-for-the-task-of#code)
2. [PhoNLP](https://github.com/VinAIResearch/PhoNLP)
3. [Underthesea](https://github.com/undertheseanlp/underthesea)
### Entity Linking
> Entity Linking is the task of locating (NER) and linking (Named Entity Disambiguation) named entities to a knowledge base (e.g. Wikidata, DBpedia, ..).
The process of linking entities to Wikipedia is called Wikification. Entity Linking is a.k.a. Named Entity Recognition and Disambiguation. For example, given the sentence:
==According to the Times Higher Education, Moscow Institute of Physics and Technology is one of the top three universities in Russia.==
a Wikipedia-based entity linking model should locate the named entities and link them to the corresponding entry in Wikipedia as follow:
| Entity | Wikipedia Entry |
| --- | --- |
| According |
| to | |
| the |
| Times Higher Education | https://en.wikipedia.org/wiki/Times_Higher_Education |
| , |
| Moscow Institue of Physics and Technology | https://en.wikipedia.org/wiki/Moscow_Institute_of_Physics_and_Technology |
| is |
| one |
| of |
| the |
| top |
| three |
| universities | https://en.wikipedia.org/wiki/University |
| in |
| Russia | https://en.wikipedia.org/wiki/Russia |
| . |
#### Approaches
+ End-to-end approach: Build one model for both NER and NE Disambiguation tasks.
+ Disambiguation-only approach: Take named entities as input and disambiguate them to the correct entites in a given knowledge.
### SOTA Entity Linking Models
#### BLINK (Facebook AI Research)
+ Paper: [BLINK: Entity Linking python library that uses Wikipedia as the target knowledge base]https://arxiv.org/pdf/1911.03814.pdf
+ Code: https://github.com/facebookresearch/BLINK
##### Problem Definition
+ Input
+ $D = \{w_1, \dots, w_r\}$: Input text document
+ $M_D = \{m1, \dots, m_n\}$: List of entity mentions
+ Output:
+ $\{(m_i, e_i)\}_{i \in [1, n]}$: List of mention-entity pairs, where each entity is an entry in the knowledge base (e.g. Wikipedia). $e_i \in \mathcal{E}$.
+ Assume that:
+ Title and description of the entities are available.
+ Each mention has a valid gold entity in the KB (in-KB evaluation > < out-of-KB or nil prediction)
##### Architecture

##### Datasets
+ [The Zero-shot EL dataset](https://github.com/lajanugen/zeshel): published by Logeswaran et al. from [Wikia](https://www.wikia.com).
+ The task is to link entity mentions in text to an entity dictionary with provided entity descriptions, in a set of domains.
+ There are 49K/10K/10K examples in the train/val/test sets respectively.
### Resources
+ http://ejmeij.github.io/entity-linking-and-retrieval-tutorial/
+ https://github.com/izuna385/Entity-Linking-Recent-Trends