<h1><center>
<h5>Capstone Project Presentation</h5>
<img src='https://i.imgur.com/Qu8tSFo.png' alt="Frontpage" title="Frontpage" width="960" height="540" />
</center></h1>
<h4>
- **Student**: Tran Thien Phu
- **Instructors**: Huy Le, Ha V. Nguyen, Tung H. Cao, Hung Ngo
- **Course**: Machine Learning/Deep Learning for AI, 2022
**LeHongPhong Highschool for the Gifted, HCMC Vietnam**
</h4>
---
<h1><center>
<h5>Introduction</h5>
<img src='https://i.imgur.com/8mgCvjY.jpg' alt="Frontpage" title="Frontpage" width="1280" height="720" />
</center></h1>
<h3>
- **Project name**: Vietnamese Essay Categorizer
- An educational tool for young students that helps with writing essays
</h3>
---
<h1>
<center>Project Description</center>
</h1>
<h3 style="text-align:left">
Why I choose this project:
</h3>
- At primary levels of education in Vietnam, students are introduced to 5 categories of essays:
- Argumentative - Nghị luận
- Expressive - Biểu cảm
- Descriptive - Miêu tả
- Narrative - Tự sự
- Expository - Thuyết minh
- New students usually find it hard to write the required category (E.g. Students describe too much in an expressive essay).
---
<h1>
<center>Project Description</center>
</h1>
<h3 style="text-align:left">
How I solve this problem:
</h3>
- Split the problem to 2 main tasks: Essay Classification & Spell Correcting
- This AI assistant includes a Fine-tuned SequenceClassificationPhoBERT model for classifying categories of input essay and a self written SpellCorrect algorithm for spell-checking and alternative-suggesting.
---
<h1>
<center>Essay Categorizing</center>
</h1>
<h2 style = "text-align:left">
<strong>Task</strong>
</h2>
<h3 style = "text-align:left">
- **Description**: Take users' essay as input and return the respective category.
- **Input**: Users' essays
- **Output**: 1 of 5 categories
</h3>
---
<h1>
<center>Essay Categorizing</center>
</h1>
<h2 style = "text-align:left">
<strong> Exprerience </strong>
</h2>
<h3 style = "text-align:left">
- Sample essays of each category on the internet
- An example for easy visualization:
</h3>
<img src='https://i.imgur.com/IWJhFoY.png' width="720" height="360" />
<h4>
- The dataset was manually created & annotated by me after hours of internet-surfing. It contains 14119 samples of labeled sentences. For this model, it was splitted to train set, validation set and test set with 8-1-1 ratio respectively.
</h4>
---
<h1>
<center>Essay Categorizing</center>
</h1>
<h2 style = "text-align:left">
<strong> Function space: </strong> Finetuned PhoBERT
</h2>
<h3>
<center>PhoBERT</center>
</h3>
<img src='https://i.imgur.com/IU8Dfu6.png' />
<h3>
</h3>
---
<h1>
<center>Essay Categorizing</center>
</h1>
<h2 style = "text-align:left">
<strong> Function space: </strong> Finetuned PhoBERT
</h2>
<h4>
<center>Finetuned PhoBERT for SequenceClassification</center>
</h4>
<img src='https://i.imgur.com/9dJC53a.png' width = 600 height = 720/>
---
<h1>
<center>Essay Categorizing</center>
</h1>
<h2 style = "text-align:left">
<strong> Function space: </strong> Finetuned PhoBERT
</h2>
<h3>
<center>Finetuned PhoBERT for SequenceClassification</center>
</h3>
- Finetuned PhoBERT-base model for classification.
- Architecture: Pretrained PhoBERT-base and a Softmax regression
---
<h1>
<center>Essay Categorizing</center>
</h1>
<h2 style = "text-align:left">
<strong> Performance metrics </strong>
</h2>
<h3>
- Categorical cross-entropy loss: Approximately 0.75
- Accuracy score: 83.35% accuracy on test set
- More details in the demonstration part :smile:
</h3>
---
<h1>
<center>Essay Categorizing</center>
</h1>
<h2 style = "text-align:left">
<strong> Algorithm </strong>
</h2>
<h3 style = "text-align:left">
- Optimization algorithm: Adam
</h3>
---
<h1>
<center>Spell Correcting</center>
</h1>
- **Task**: Detect mistakes and provide alternative suggestions
- Input: A text sequence
- Output: Detected mistakes with suggestions
- **Experience**: A 74k Vietnamese words dictionary
- **Function space**: Edit-distances between words
- **Performance**: Works well with typo, but not words in context (e.g. sửa chữa & sữa chữa)
- **Algorithm**: Damerau Levenshtein Distance of [textdistance](https://github.com/life4/textdistance) module
---
# Why PhoBERT
- Pre-trained PhoBERT models are the state-of-the-art language models for Vietnamese NLP tasks.
- Read more about PhoBERT on [huggingface](https://huggingface.co/vinai/phobert-base) or [EMNLP-2020 Findings paper](https://arxiv.org/abs/2003.00744)
---
# Why Damerau-Levenshtein
- Comparing to other algorithms, Damerau-Levenshtein excels in time-efficiency
<center><img src='https://i.imgur.com/qd29W1R.png' width="1280" height="700" /></center>
---
# Data
- For Essay Categorization:
- Sample essays from various sources on the internet e.g. vndoc, vietjack, etc.
- [Dataset link](https://drive.google.com/file/d/1sdCPxp0dSQ5Nx-KmnJBqcJZGWtCj4DB8/view?usp=sharing) (Google Drive)
- For Spell Check:
- A small dictionary of 74k Vietnamese words.
- [Dictionary link](https://github.com/PaulTran2734/vietnamese_essay_identifier/tree/main/TextCorrect) (GitHub)
---
# Demonstration
### My streamlit app is available on streamlit sharing
Check out my GitHub repository for tutorial on locally run streamlit or click this link for [my Streamlit app](https://share.streamlit.io/paultran2734/vietnamese_essay_identifier/main/Final.py)
---
# Lessons
- **More knowledge about NLP - A step closer to my goals**
- Create, annotate and process dataset for NLP tasks.
- Debugging skills and **patience** :frowning:
- Time-management and simple web-app development
---
{"metaMigratedAt":"2023-06-17T02:53:50.679Z","metaMigratedFrom":"YAML","title":"\n Project Description\n","breaks":true,"description":"slide for final project presentation","slideOptions":"{\"transition\":\"fade\",\"theme\":\"serif\",\"width\":1920,\"height\":1080,\"margin\":0.1,\"minScale\":0.2,\"maxScale\":1.5}","contributors":"[{\"id\":\"44666f25-df8c-48a2-b367-286278bd273f\",\"add\":8039,\"del\":4203}]"}