Project Description

<h1><center> <h5>Capstone Project Presentation</h5> <img src='https://i.imgur.com/Qu8tSFo.png' alt="Frontpage" title="Frontpage" width="960" height="540" /> </center></h1> <h4> - **Student**: Tran Thien Phu - **Instructors**: Huy Le, Ha V. Nguyen, Tung H. Cao, Hung Ngo - **Course**: Machine Learning/Deep Learning for AI, 2022 **LeHongPhong Highschool for the Gifted, HCMC Vietnam** </h4> --- <h1><center> <h5>Introduction</h5> <img src='https://i.imgur.com/8mgCvjY.jpg' alt="Frontpage" title="Frontpage" width="1280" height="720" /> </center></h1> <h3> - **Project name**: Vietnamese Essay Categorizer - An educational tool for young students that helps with writing essays </h3> --- <h1> <center>Project Description</center> </h1> <h3 style="text-align:left"> Why I choose this project: </h3> - At primary levels of education in Vietnam, students are introduced to 5 categories of essays: - Argumentative - Nghị luận - Expressive - Biểu cảm - Descriptive - Miêu tả - Narrative - Tự sự - Expository - Thuyết minh - New students usually find it hard to write the required category (E.g. Students describe too much in an expressive essay). --- <h1> <center>Project Description</center> </h1> <h3 style="text-align:left"> How I solve this problem: </h3> - Split the problem to 2 main tasks: Essay Classification & Spell Correcting - This AI assistant includes a Fine-tuned SequenceClassificationPhoBERT model for classifying categories of input essay and a self written SpellCorrect algorithm for spell-checking and alternative-suggesting. --- <h1> <center>Essay Categorizing</center> </h1> <h2 style = "text-align:left"> <strong>Task</strong> </h2> <h3 style = "text-align:left"> - **Description**: Take users' essay as input and return the respective category. - **Input**: Users' essays - **Output**: 1 of 5 categories </h3> --- <h1> <center>Essay Categorizing</center> </h1> <h2 style = "text-align:left"> <strong> Exprerience </strong> </h2> <h3 style = "text-align:left"> - Sample essays of each category on the internet - An example for easy visualization: </h3> <img src='https://i.imgur.com/IWJhFoY.png' width="720" height="360" /> <h4> - The dataset was manually created & annotated by me after hours of internet-surfing. It contains 14119 samples of labeled sentences. For this model, it was splitted to train set, validation set and test set with 8-1-1 ratio respectively. </h4> --- <h1> <center>Essay Categorizing</center> </h1> <h2 style = "text-align:left"> <strong> Function space: </strong> Finetuned PhoBERT </h2> <h3> <center>PhoBERT</center> </h3> <img src='https://i.imgur.com/IU8Dfu6.png' /> <h3> </h3> --- <h1> <center>Essay Categorizing</center> </h1> <h2 style = "text-align:left"> <strong> Function space: </strong> Finetuned PhoBERT </h2> <h4> <center>Finetuned PhoBERT for SequenceClassification</center> </h4> <img src='https://i.imgur.com/9dJC53a.png' width = 600 height = 720/> --- <h1> <center>Essay Categorizing</center> </h1> <h2 style = "text-align:left"> <strong> Function space: </strong> Finetuned PhoBERT </h2> <h3> <center>Finetuned PhoBERT for SequenceClassification</center> </h3> - Finetuned PhoBERT-base model for classification. - Architecture: Pretrained PhoBERT-base and a Softmax regression --- <h1> <center>Essay Categorizing</center> </h1> <h2 style = "text-align:left"> <strong> Performance metrics </strong> </h2> <h3> - Categorical cross-entropy loss: Approximately 0.75 - Accuracy score: 83.35% accuracy on test set - More details in the demonstration part :smile: </h3> --- <h1> <center>Essay Categorizing</center> </h1> <h2 style = "text-align:left"> <strong> Algorithm </strong> </h2> <h3 style = "text-align:left"> - Optimization algorithm: Adam </h3> --- <h1> <center>Spell Correcting</center> </h1> - **Task**: Detect mistakes and provide alternative suggestions - Input: A text sequence - Output: Detected mistakes with suggestions - **Experience**: A 74k Vietnamese words dictionary - **Function space**: Edit-distances between words - **Performance**: Works well with typo, but not words in context (e.g. sửa chữa & sữa chữa) - **Algorithm**: Damerau Levenshtein Distance of [textdistance](https://github.com/life4/textdistance) module --- # Why PhoBERT - Pre-trained PhoBERT models are the state-of-the-art language models for Vietnamese NLP tasks. - Read more about PhoBERT on [huggingface](https://huggingface.co/vinai/phobert-base) or [EMNLP-2020 Findings paper](https://arxiv.org/abs/2003.00744) --- # Why Damerau-Levenshtein - Comparing to other algorithms, Damerau-Levenshtein excels in time-efficiency <center><img src='https://i.imgur.com/qd29W1R.png' width="1280" height="700" /></center> --- # Data - For Essay Categorization: - Sample essays from various sources on the internet e.g. vndoc, vietjack, etc. - [Dataset link](https://drive.google.com/file/d/1sdCPxp0dSQ5Nx-KmnJBqcJZGWtCj4DB8/view?usp=sharing) (Google Drive) - For Spell Check: - A small dictionary of 74k Vietnamese words. - [Dictionary link](https://github.com/PaulTran2734/vietnamese_essay_identifier/tree/main/TextCorrect) (GitHub) --- # Demonstration ### My streamlit app is available on streamlit sharing Check out my GitHub repository for tutorial on locally run streamlit or click this link for [my Streamlit app](https://share.streamlit.io/paultran2734/vietnamese_essay_identifier/main/Final.py) --- # Lessons - **More knowledge about NLP - A step closer to my goals** - Create, annotate and process dataset for NLP tasks. - Debugging skills and **patience** :frowning: - Time-management and simple web-app development ---