# SLR
# Abstract
short problem
what is cwi
what we showed in the end
# Introduction/background
Language is one of the oldest instruments for communication. As Melina Marchetta, the creator of "Finnikin of the Rock", once said: "Without our language, we have lost ourselves. Who are we without our words?". Indeed, without the language, people would never come to the future where we are now.
On the other hand, there are thousands of languages across the world. How do people understand each other in a cross-cultural manner? The answer is they learn them. For example, according to the statistics, all or nearly all (99-100 %) primary school pupils in Cyprus, Malta, Austria, and Spain learned English as a foreign language in 2018. [reference](https://ec.europa.eu/eurostat/statistics-explained/index.php/Foreign_language_learning_statistics) Learning the language implies remembering the structure, rules and different words of the foreign language, from the simplest ones to the hardest.
Also, according to Learning Disabilities Statistics 2020, around 5-9% of the population has a learning disability. This means that over 300M of people can't read their mother tongue language in a usual way. Their learning process requires slow growth from simple sentences to complex ones.
This is where natural language processing steps up. Computer algorithms can perform a different operation on text to transform it into a new set of sentences. And the problems that were described above also can be solved with such processes.
Several NLP systems have been developed to simplify texts to second language learners and native speakers with low literacy levels and reading disabilities. Usually, the first step in the lexical simplification pipeline is identifying which words are considered complex by a given target population. [ref](https://www.researchgate.net/publication/301404409_Reliable_Lexical_Simplification_for_Non-Native_Speakers/figures?lo=1) This task is known as complex word identification (CWI) and it has been attracting attention from the research community in the past few years.

# Background
Natural language processing (NLP) is a branch of linguistics, computer science, and artificial intelligence that studies how computers communicate with human language, especially how to program computers to process and analyze large amounts of natural language data.
A feature is an observable property or attribute of a phenomenon under investigation. For the word feature can be number of letters, n-grams, synonyms etc.
A model is a mathematical operation, that consist of several important parts: taking an input, process it, get an output. The difference between models are the operation that are performed on the input. In NLP input is a set of features that was extracted from the raw data.
Usually, models require learning. This phase of work is called training. Evaluating phase usually performed on a data that model never seen, e.g. the data that was not used in training phase.
Complex word identification (CWI) itself was explored in various other studies and underlying approaches can be split into two main categories: monolingual and cross-lingual.
# Objectives
The objective of this literature revision is to systematically review and analyze the current research on the best strategies that can be used for complex word identification.
The objective can be reconstructed as a research questions:
* What is so complex in CWI?
* What are the methods for CWI?
* What combination of features and model is the best for CWI?
This paper describes different strategies that can be used for Complex Word identification. Each strategy will contain methodology and ideas that are behind a particular solution. As a sum up, there will be information about the best features that can be used for the task.
# Study selection
The Google Scholar search engine was used to find required papers. Articles were rejected if it was determined from the content that the study failed to meet the inclusion criteria. Every processed paper was analyzed. The important information from the articles - methods, results, possible shortcomings, further development - was added into the literature log. After that, papers that meet the exclusion criteria were rejected.
## Search query
Search query that was used to find the articles is:
* "CWI" && ("mono-lingual" || "cross-lingual") && year>2015
As a result of search query there are 67 different articles. 30 articles were chosen for processing.
## Inclusion/Exclusion criteria
* Article should contains information about the features that was used to train the model
* Article should contains results of model evaluation
* Article should contains comparison with other approaches
* Article should not be published before 2015
## Methods
The main instrument for systematic literature review was literature log, where every article taken from the query result was described. The attributes that was derived were devided into two groups: non-content-based and content-based.
> it is better to make table here, because list is not representative enough
Non-content-based attributes are:
* Title
* Main title of the article
* Author
* All authors that are described in the author section
* Year
* When the article was published
Content-based attributes are:
* Key themes
* Main themes of article
* Model
* Model that was used in the particular solution
* Features
* Features that was extrated from the initial data
* Major findings
* The most important information that was extrated from the reading
* Possible shortcomings
* Points of criticism
* Possible developments
* Points that should be observed in future work
Every attribute was derived from the article and written into the table (Supporting information #1). After that every metric that is mentioned in articles, was written into the table (Supporting information #2). Because every paper uses its own metric system, top article from every metric was highlighted for next analyze part. Also, if there is only one article that uses a metric (example) it was analyzed in comparison to other papers individually. After that features and methods was extracted and written into the table of features (Supporting information #3) and the table of methods.
That informations gives us a set of features and pool of methods that was used in the best-performed solutions.
# Results
This section is divided into two parts:
* Results for mono-lingual models
* Results for cross-lingual models
## Mono-lingual results
### Features
Common features that was used for CWI are:
* N-gram corpora frequency
* Language model probability
* Term frequency in corpora
* Different frequency in vocabularies
Other mentioned important features are:
* Target word length
* POS tag
* Is lemma or lemma itself
### Classifiers
Most common methods that was used in mono-lingual CWI are:
* Voting Classifier
* AdaBoost
* Random Forest
* Combined threshold
## Cross-lingual results
### Features
In cross-lingual solutions word embeddings was the most frequent feature. Logically, it is motivated by the fact that word embeddings can be treated as all-language features that could be used to derive information about the word from every language.
Common mono-lingual features like word length, and n-gram frequencies are not that important for cross-lingual solutions, because usually they are language dependent.
Other used features are:
* cosine similarities with mono- and cross-lingual synonyms,
* cosine_symilarity with target language translation,
* POS tag.
### Classifiers
Methods that was used for cross-lingual CWI are:
* SVM,
* CNN,
* Extra Trees.
# Discussion
One of the main problems of NLP tasks are the differences between the languages. Because of that it is mandatory to create models for each language independently. Otherwise, models can be so much big that computations could be performed only on super-computers. That is why for the best results NLP professionals create train data from the same language it is required for. In other case, as in cross-lingual approach in CWI, strategy (features, models, dataset languages) should be accurately chosen to get satisfactory result. That is why it is hard to create such NLP systems.
Also, it is important to mention, that even collecting dataset, especially for complex word identification, should be properly performed. As we know, "complexity" of certain word is very subjective depending on the people, who was chosen to answer the question "Is it complex word?". Response on this question will be different for native and non-native speakers, children and adults. So, this factor affects on dataset, and then solution quality.
# Conclusion
CWI task is very important for different parts of language understanding, but it is hard to create such system, because languages are very big and there is no enough data for full-working solution. Due to this, all solutions are derived on two groups mono- and cross-lingual systems. Best strategy for mono-lingual system is using language dependent features as N-gram corpora frequency, language model probability and term frequency in corpora with different methods like Voting Classifier, AdaBoost or Random Forest. For cross-lingual solutions best strategy is using language independent features like word embeddings with SVM, CNN or Extra Trees as method.
# References
# Supporting information
## supporting information 1
Lit log
## supporting information 2
metrics table
## supporting information 3
features table