Proposal of Final Project

# Proposal of Final Project ### Vietnamese Spell Correction - Nguyet Vo ## Objective - Currently, with the development of Machine Learning algorithms (ML), especially Deep Learning (DL) algorithms, articles of Natural Language Processing have achieved great success. Therefore, to access to new technologies and algorithms, text classification skills have been selected by using DL algorithms, namely Long Short Term Memory (LSTM) to implement. - Example: ![](https://i.imgur.com/EHhVJRZ.png) ## Levers 1. Pre-processing of documents 2. Word embedding - Freqency-base embedding: BoW and TF-IDF - Prediction Base Embedding: Word2vec 3. RNN 4. LSTM Detail: **1. Pre-processing of documents:** **a. Clean text:** - Eliminate special characters. Data when retrieved from the Internet will usually have special characters such as "@ # $% ^", or icons and symbols html tag. These characters are noises that reduce the accuracy of the assignment process document type. **b. Word separator:** - In Vietnamese, space cannot be used as a symbol word splitting, it only makes sense to separate syllables together. Therefore, to handle this problem in Vietnamese, word segmentation is one of the most important mechanical problems. For example, the word "country (đất nước)" is made up of two "land" (đất) syllable and "water" (nước) syllable. Both syllables have their own meanings when standing independently, but when put together will have a different meaning. Because of this feature, the word separation problem becomes a prerequisite problem for other natural language processing applications like text classification, text summary, automatic translation machine, ... **c. Standardize words:** - The goal is to bring text from heterogeneous forms back same type. From the perspective of optimal memory storage and accuracy is also very important. Often in this step will standardize fonts, uppercase characters change to no capitalization ... **d. Remove stopwords:** - Stopwords is the word that appears in many languages naturally;however, it does not bring the meaningful much. In Vietnamese, stopWords are words like: để, này, kia... English is words like: is, that, this .. There are 2 main ways to remove stopwords: + Use stopword dictionary (build dictionary stopword and remove those words) + And use the frequency of occurrence (tf-idf, the simple idea is that the more words appear, the more is likely stopwords). **e. Building dictionaries:** - We need to turn all the words in the text of into a numerical representation. The simplest way we can do it that is to build a dictionary and then replace the word with the export order show in dictionary. **f. Vectorization of words and text:** - This step aims to vectorize words in each the sentence. The algorithm uses to vectorize words or word2vect solution: denote each word as a vector. Alternatively we can also use doc2vect: expression text displays into 1 vector. Gensim is a library often used when vectorizing words and texts. **2. Word embedding:** Word Embedding is mainly divided into 2 categories: - Frequency-based embedding: + Relying on the frequency of words to create word vectors, the most popular of which is Bag of Word (BoW) and TF-IDF. + For Bag of Words algorithm: the idea of this algorithm is analysis and grouping based on "Bag of Words" (corpus). With the new test data, we find out how many times each word of the test data appears in the "bag", but there are still shortcomings, so TF-IDF is the legal solution. - Prediction-based embedding: Word2vec + Skip-gram model + Skip-gram model predicts word surrounding when given a given word, such as text = "I love you so much". When using a window search of size 3 that we get: {(i, you), love}, {(love, so), you}, {(you, much), so}. Its task is to give a word center, for example love, to predict the surrounding words as i, you. + Continuous bag of word (Cbow) + This model is the opposite of skip-gram model, which is for surrounding predict word word current. + In fact, people only choose one of the two models for training, Cbow is faster for training but the accuracy is not as high as skip-gram and vice versa. **3. RNN:** + RNN was born with the idea of using memory to save information from previous processing steps to make an accurate prediction of the projected steps in the current. + RNN has many forms: one to one, one to many, many to one, many to many. ![](https://i.imgur.com/PzbM1Wd.png) **4. Long Short Term Memory networks (LSTM):** + Input Gate/ Update Gate: Determines the number of units added to the current state. ![](https://i.imgur.com/ovYRtxR.png) + Output Gate: Decides which part of the current cell that makes it as an output. ![](https://i.imgur.com/xKwz4Kq.png) + Forget Gate: Determines which information is ignored in specific timestamp. ![](https://i.imgur.com/aSXXlFz.png) Model: ![](https://i.imgur.com/eEMA67p.png) ![](https://i.imgur.com/PwSOcdo.png) ## Models **1. Data description:** + The data includes more than 80,000 articles divided into 10 topics including: Life, World, Science, Sports, Business, Cultural, Law, Health, Computer, Social and political + Articles belong to the same topic are stored in the same folder. + Train set includes more than 50,000 articles, test set includes more than 30,000 articles. + Preparing dataset: + Dataset : + VNTC: + https://github.com/duyvuleo/VNTC + https://drive.google.com/drive/folders/16IjAeRrKJmLIZfKicbOcyhLaMT5RN1IT + Viwiki-articles: https://dumps.wikimedia.org/viwiki/latest/ + Add noise: + Error typing Vietnamese ( à -> af ,ờ : owf ,đ -> dd ...) + Error abbreviation ( tôi -> t ,anh -> a ,không -> ko ...) + Syllable error ( x-s, l-n , d- gi) + Teen code ( ch -> ck ,th -> tk , ph -> f) + Forget to type accents + Model training: + Using Google Colab GPU + Process : + 10 Topics + 27 Topics for transfer + Viwiki-articles (5MB) to evaluate the model **2. Steps:** - Read data, divide data into 3 parts as train, validate, test. This dataset includes data and label. - Preprocessing: + Separated words + Remove stopword + special characters + Build word library + TF-IDF + One hot code for label - Train: + Building LSTM model + Edit network structure and parameters ![](https://i.imgur.com/DKRbGWb.png) ## Executive **Spell Correction Problem:** ***a. Overview Introduction:*** - Auto Correction is one of the most basic problems in natural language processing: + This feature is included in applications of text editing, inputting, identification ... With writing text on mobile phones, it is easy to generate errors, the automatic correction of spelling errors is an indispensable component in any keyboard. + The auto correction technique has been very developed and worked very well with many languages, especially English. But with Vietnamese? - Compared to other languages, Vietnamese accents are a lot more complicated: + More than 95% of Vietnamese words contain diacritics. While in French it is 15%, Romanian is 35%. + More than 80% of syllables with missing and duplicate marks are 50% in French, and 25% in Romanian. + The average ratio of the accents of words in French and Romanian is about 1.2, while Vietnamese is greater than 2. - Some common errors: + Econception error + Typo errors or abbreviations + Slang ***b. Deep Learning in Spell Correction problem:*** 1. RNN: ![](https://i.imgur.com/tF5r8lR.png) 2. LSTM: ![](https://i.imgur.com/IZRVCtS.png) 3. Bidirectional LSTM: It helps to take advantage of information in the back instead of just focusing on front information like traditional RNN networks. ![](https://i.imgur.com/Klam2Wh.png) 4. Encoder-Decoder LSTM (seq2seq): ![](https://i.imgur.com/XiBsPgd.png) ![](https://i.imgur.com/x767Zxx.png) 5. Beam search decoding: ![](https://i.imgur.com/iBMnqaZ.png) 6. Attention: ![](https://i.imgur.com/Ez86Cc5.png) ## References 1. https://www.slideshare.net/gmovnlab/ng-dng-nlp-vo-vic-xc-nh-mun-ngi-dng-intent-detection-v-sa-li-chnh-t-trong-ting-vit-spell-correction 2. https://sci-hub.tw/https://link.springer.com/chapter/10.1007/978-3-319-11680-8_49 3. https://viblo.asia/p/ban-ve-xu-ly-ngon-ngu-tieng-viet-924lJYdYZPM 4. https://github.com/undertheseanlp/NLP-Vietnamese-progress/blob/master/tasks/spelling_correction.md 5. https://github.com/undertheseanlp 6. https://ongxuanhong.wordpress.com/2016/02/06/gioi-thieu-cac-cong-cu-xu-ly-ngon-ngu-tu-nhien/ 7. https://sci-hub.tw/10.1109/RIVF.2008.4586339 8. https://eprints.uet.vnu.edu.vn/eprints/id/eprint/3875/1/Pacling%202019.pdf 9. https://kipalog.com/posts/Gioi-thieu-tien-xu-ly-trong-xu-ly-ngon-ngu-tu-nhien