maevgor
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # SLR # Abstract short problem what is cwi what we showed in the end # Introduction/background Language is one of the oldest instruments for communication. As Melina Marchetta, the creator of "Finnikin of the Rock", once said: "Without our language, we have lost ourselves. Who are we without our words?". Indeed, without the language, people would never come to the future where we are now. On the other hand, there are thousands of languages across the world. How do people understand each other in a cross-cultural manner? The answer is they learn them. For example, according to the statistics, all or nearly all (99-100 %) primary school pupils in Cyprus, Malta, Austria, and Spain learned English as a foreign language in 2018. [reference](https://ec.europa.eu/eurostat/statistics-explained/index.php/Foreign_language_learning_statistics) Learning the language implies remembering the structure, rules and different words of the foreign language, from the simplest ones to the hardest. Also, according to Learning Disabilities Statistics 2020, around 5-9% of the population has a learning disability. This means that over 300M of people can't read their mother tongue language in a usual way. Their learning process requires slow growth from simple sentences to complex ones. This is where natural language processing steps up. Computer algorithms can perform a different operation on text to transform it into a new set of sentences. And the problems that were described above also can be solved with such processes. Several NLP systems have been developed to simplify texts to second language learners and native speakers with low literacy levels and reading disabilities. Usually, the first step in the lexical simplification pipeline is identifying which words are considered complex by a given target population. [ref](https://www.researchgate.net/publication/301404409_Reliable_Lexical_Simplification_for_Non-Native_Speakers/figures?lo=1) This task is known as complex word identification (CWI) and it has been attracting attention from the research community in the past few years. ![](https://i.imgur.com/KPWhRIc.png =350x) # Background Natural language processing (NLP) is a branch of linguistics, computer science, and artificial intelligence that studies how computers communicate with human language, especially how to program computers to process and analyze large amounts of natural language data. A feature is an observable property or attribute of a phenomenon under investigation. For the word feature can be number of letters, n-grams, synonyms etc. A model is a mathematical operation, that consist of several important parts: taking an input, process it, get an output. The difference between models are the operation that are performed on the input. In NLP input is a set of features that was extracted from the raw data. Usually, models require learning. This phase of work is called training. Evaluating phase usually performed on a data that model never seen, e.g. the data that was not used in training phase. Complex word identification (CWI) itself was explored in various other studies and underlying approaches can be split into two main categories: monolingual and cross-lingual. # Objectives The objective of this literature revision is to systematically review and analyze the current research on the best strategies that can be used for complex word identification. The objective can be reconstructed as a research questions: * What is so complex in CWI? * What are the methods for CWI? * What combination of features and model is the best for CWI? This paper describes different strategies that can be used for Complex Word identification. Each strategy will contain methodology and ideas that are behind a particular solution. As a sum up, there will be information about the best features that can be used for the task. # Study selection The Google Scholar search engine was used to find required papers. Articles were rejected if it was determined from the content that the study failed to meet the inclusion criteria. Every processed paper was analyzed. The important information from the articles - methods, results, possible shortcomings, further development - was added into the literature log. After that, papers that meet the exclusion criteria were rejected. ## Search query Search query that was used to find the articles is: * "CWI" && ("mono-lingual" || "cross-lingual") && year>2015 As a result of search query there are 67 different articles. 30 articles were chosen for processing. ## Inclusion/Exclusion criteria * Article should contains information about the features that was used to train the model * Article should contains results of model evaluation * Article should contains comparison with other approaches * Article should not be published before 2015 ## Methods The main instrument for systematic literature review was literature log, where every article taken from the query result was described. The attributes that was derived were devided into two groups: non-content-based and content-based. > it is better to make table here, because list is not representative enough Non-content-based attributes are: * Title * Main title of the article * Author * All authors that are described in the author section * Year * When the article was published Content-based attributes are: * Key themes * Main themes of article * Model * Model that was used in the particular solution * Features * Features that was extrated from the initial data * Major findings * The most important information that was extrated from the reading * Possible shortcomings * Points of criticism * Possible developments * Points that should be observed in future work Every attribute was derived from the article and written into the table (Supporting information #1). After that every metric that is mentioned in articles, was written into the table (Supporting information #2). Because every paper uses its own metric system, top article from every metric was highlighted for next analyze part. Also, if there is only one article that uses a metric (example) it was analyzed in comparison to other papers individually. After that features and methods was extracted and written into the table of features (Supporting information #3) and the table of methods. That informations gives us a set of features and pool of methods that was used in the best-performed solutions. # Results This section is divided into two parts: * Results for mono-lingual models * Results for cross-lingual models ## Mono-lingual results ### Features Common features that was used for CWI are: * N-gram corpora frequency * Language model probability * Term frequency in corpora * Different frequency in vocabularies Other mentioned important features are: * Target word length * POS tag * Is lemma or lemma itself ### Classifiers Most common methods that was used in mono-lingual CWI are: * Voting Classifier * AdaBoost * Random Forest * Combined threshold ## Cross-lingual results ### Features In cross-lingual solutions word embeddings was the most frequent feature. Logically, it is motivated by the fact that word embeddings can be treated as all-language features that could be used to derive information about the word from every language. Common mono-lingual features like word length, and n-gram frequencies are not that important for cross-lingual solutions, because usually they are language dependent. Other used features are: * cosine similarities with mono- and cross-lingual synonyms, * cosine_symilarity with target language translation, * POS tag. ### Classifiers Methods that was used for cross-lingual CWI are: * SVM, * CNN, * Extra Trees. # Discussion One of the main problems of NLP tasks are the differences between the languages. Because of that it is mandatory to create models for each language independently. Otherwise, models can be so much big that computations could be performed only on super-computers. That is why for the best results NLP professionals create train data from the same language it is required for. In other case, as in cross-lingual approach in CWI, strategy (features, models, dataset languages) should be accurately chosen to get satisfactory result. That is why it is hard to create such NLP systems. Also, it is important to mention, that even collecting dataset, especially for complex word identification, should be properly performed. As we know, "complexity" of certain word is very subjective depending on the people, who was chosen to answer the question "Is it complex word?". Response on this question will be different for native and non-native speakers, children and adults. So, this factor affects on dataset, and then solution quality. # Conclusion CWI task is very important for different parts of language understanding, but it is hard to create such system, because languages are very big and there is no enough data for full-working solution. Due to this, all solutions are derived on two groups mono- and cross-lingual systems. Best strategy for mono-lingual system is using language dependent features as N-gram corpora frequency, language model probability and term frequency in corpora with different methods like Voting Classifier, AdaBoost or Random Forest. For cross-lingual solutions best strategy is using language independent features like word embeddings with SVM, CNN or Extra Trees as method. # References # Supporting information ## supporting information 1 Lit log ## supporting information 2 metrics table ## supporting information 3 features table

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully