Texterra - HackMD

# Texterra ## Links + [Texterra Description](https://www.ispras.ru/en/projects/texterra/) + [Demo](https://texterra.ispras.ru/demo) ## Modules + Information Extraction (Извлечение информации) + NER (Именованные сущности) + Wikification (Викификация): Wikipedia-based Entity Linking + Key Named Entities :question: (Ключевые сущности) + Lurkification ([Луркификация](https://en.wikipedia.org/wiki/Lurkmore)): Lurkmore-based Entity Linking + Linguistic (Лингвистика) + Morphological Analysis (Морфология) + Syntactic Analysis (Синтаксис) + Coreference Resolution (Кореферентности) + Error Correction (Исправление ошибок) + Sentiment Analysis (Анализ мнений) + Opinion Mining (Определение тональности) ## Tasks ### Named Entity Recognition > Named Entity Recognition (NER) is the task of finding named entities in a given text then classifying them into pre-defined tags. There are two most common approaches to NER task. The first one forms NER task as ```Sequence Labeling``` task whose purpose is to label each token in a given text with one (or several) pre-defined tag. For example, given a text ==Burtsev Mikhail Sergeevich lives in Saint Petersburg== and four pre-defined tags: ```PER```, ```ORG```, ```LOC```, and ```MISC``` with the ```IOB``` tagging scheme, a Sequence Labeling-based model should label each tokens as follow: ```c | Burtsev | Mikhail | Sergeevich | lives | in | Saint | Petersburg | | B-PER | I-PER | I-PER | O | O | B-LOC | I-LOC | ``` The second approach treats NER task as the task of ```Sequence-to-Sequence``` like Machine Translation, Text Summarization. #### Evaluation Metrics [Precision, Recall, and F-measure](https://en.wikipedia.org/wiki/Precision_and_recall). #### SOTA NER Models | Model | Architecture | F1[^1] | Paper | Code | | --- | --- | :---: | --- | --- | | ACE | Automated Concatnation of Embeddings + Document Context | 94.6 | [Automated Concatnation of Embeddings for Structured Prediction](https://arxiv.org/pdf/2010.05006.pdf) | [code](https://github.com/Alibaba-NLP/ACE) | | LUKE | Deep Contextualized Representation + Entity-aware Self-attention | 94.3 | [LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention](https://www.aclweb.org/anthology/2020.emnlp-main.523/) | [code](https://github.com/studio-ousia/luke) | | CL-KL | External Context + Transformer + CRF | 93.9 | [Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning](https://arxiv.org/abs/2105.03654) | [code](https://github.com/Alibaba-NLP/CLNER) [^1]: F1(%) on CoNLL2003 dataset #### Vietnamese NER models 1. [Bi-LSTM + CNN + CRF](https://paperswithcode.com/paper/a-deep-neural-network-model-for-the-task-of#code) 2. [PhoNLP](https://github.com/VinAIResearch/PhoNLP) 3. [Underthesea](https://github.com/undertheseanlp/underthesea) ### Entity Linking > Entity Linking is the task of locating (NER) and linking (Named Entity Disambiguation) named entities to a knowledge base (e.g. Wikidata, DBpedia, ..). The process of linking entities to Wikipedia is called Wikification. Entity Linking is a.k.a. Named Entity Recognition and Disambiguation. For example, given the sentence: ==According to the Times Higher Education, Moscow Institute of Physics and Technology is one of the top three universities in Russia.== a Wikipedia-based entity linking model should locate the named entities and link them to the corresponding entry in Wikipedia as follow: | Entity | Wikipedia Entry | | --- | --- | | According | | to | | | the | | Times Higher Education | https://en.wikipedia.org/wiki/Times_Higher_Education | | , | | Moscow Institue of Physics and Technology | https://en.wikipedia.org/wiki/Moscow_Institute_of_Physics_and_Technology | | is | | one | | of | | the | | top | | three | | universities | https://en.wikipedia.org/wiki/University | | in | | Russia | https://en.wikipedia.org/wiki/Russia | | . | #### Approaches + End-to-end approach: Build one model for both NER and NE Disambiguation tasks. + Disambiguation-only approach: Take named entities as input and disambiguate them to the correct entites in a given knowledge. ### SOTA Entity Linking Models #### BLINK (Facebook AI Research) + Paper: [BLINK: Entity Linking python library that uses Wikipedia as the target knowledge base]https://arxiv.org/pdf/1911.03814.pdf + Code: https://github.com/facebookresearch/BLINK ##### Problem Definition + Input + $D = \{w_1, \dots, w_r\}$: Input text document + $M_D = \{m1, \dots, m_n\}$: List of entity mentions + Output: + $\{(m_i, e_i)\}_{i \in [1, n]}$: List of mention-entity pairs, where each entity is an entry in the knowledge base (e.g. Wikipedia). $e_i \in \mathcal{E}$. + Assume that: + Title and description of the entities are available. + Each mention has a valid gold entity in the KB (in-KB evaluation > < out-of-KB or nil prediction) ##### Architecture ![](https://i.imgur.com/rJntyxe.png) ##### Datasets + [The Zero-shot EL dataset](https://github.com/lajanugen/zeshel): published by Logeswaran et al. from [Wikia](https://www.wikia.com). + The task is to link entity mentions in text to an entity dictionary with provided entity descriptions, in a set of domains. + There are 49K/10K/10K examples in the train/val/test sets respectively. ### Resources + http://ejmeij.github.io/entity-linking-and-retrieval-tutorial/ + https://github.com/izuna385/Entity-Linking-Recent-Trends