# 國立中山大學NLP Workshop
- 講師: 江豪文[Haowen Jiang](https://howard-haowen.github.io/)
- 場地: 在雲端
- 時間: 2022-04-15 ~ 2022-06-10
- 網頁: [點我](https://howard-haowen.github.io/NLP-demos/nsysu_workshop)
---
## 參考資料
- [spaCy notebooks](https://github.com/explosion/spacy-notebooks)
- [NLP Town notebooks](https://github.com/nlptown/nlp-notebooks)
![](https://cdn-icons-png.flaticon.com/512/2845/2845825.png =300x)
----
## 相關書籍
<style>
.twocolumn {
display: grid;
grid-template-columns: 1fr 1fr;
grid-gap: 10px;
text-align: center;
}
</style>
<div class="twocolumn">
<div>
<img src="https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1630086235l/58870327._SX318_.jpg">
</div>
<div>
<img src="https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1591328063l/53832790._SX318_.jpg">
</div>
</div>
----
## 工具
- [nltk](https://www.nltk.org/)
- [spaCy](https://spacy.io/) Playground [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/howard-haowen/rise-env/main?urlpath=git-pull%3Frepo%3Dhttps%253A%252F%252Fgithub.com%252Fhoward-haowen%252FNLP-demos%26urlpath%3Dtree%252FNLP-demos%252Fspacy_playground.ipynb%26branch%3Dmain)
- [stanza](https://stanfordnlp.github.io/stanza/)
- [gensim](https://radimrehurek.com/gensim/)
- [sklearn](https://scikit-learn.org/stable/)
----
## spaCy
```python=
import spacy
nlp = spacy.load(MODEL_NAME)
# When dealing with a single text in a pipeline
doc = nlp(TEXT)
# When simply tokenizing a single text
doc = nlp.make_doc(TEXT)
# When dealing with a batch of texts
doc = nlp.pipe(TEXTS, batch_size=50)
````
- 句子: `doc.sents`
- 命名實體: `doc.ents`
- 靜態向量: `doc.vector`
- 動態向量: `doc._.trf_data`
- 文本類別: `doc.cats`
----
## 資料集
- [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/howard-haowen/NLP-demos/blob/main/nlp_datasets.ipynb)
```python=
# scikit-learn
from sklearn.datasets import fetch_20newsgroups
# gensim
import gensim.downloader as api
# tatoeba
from tatoebatools import tatoeba
# nltk
import nltk
# datasets from Hugging Face
from datasets import list_datasets
````
---
## 第1️⃣週
- NLP相關應用
- 熟悉Colab環境
- 調用預訓練模型
- [Tokenization](https://en.wikipedia.org/wiki/Lexical_analysis#Tokenization)
- [Parts of speech](https://en.wikipedia.org/wiki/Part_of_speech)
- [Named entity recognition](https://en.wikipedia.org/wiki/Named-entity_recognition)
- [Dependency parsing](https://en.wikipedia.org/wiki/Syntactic_parsing_(computational_linguistics)#Dependency_parsing)
- Notebook [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/howard-haowen/NLP-demos/blob/main/NSYSU/W01-use-pretrained-models.ipynb)
----
## 第2️⃣週
- 取得資料集
- 資料預處理
- 訓練主題模型
- [Bag of Words(BOW)](https://en.wikipedia.org/wiki/Bag-of-words_model)
- [N-gram](https://en.wikipedia.org/wiki/N-gram)
- [LDA](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)
- [HDP](https://en.wikipedia.org/wiki/Hierarchical_Dirichlet_process)
- Notebook [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/howard-haowen/NLP-demos/blob/main/NSYSU/W02-topic-modelling.ipynb)
----
## 第3️⃣週
- 文本向量化1
- Frequency-based representation
- [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)
- [文本聚類](https://en.wikipedia.org/wiki/Document_clustering)
- [k-means clustering](https://en.wikipedia.org/wiki/K-means_clustering)
- Notebook [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/howard-haowen/NLP-demos/blob/main/NSYSU/W03-document-vectorization-and-clustering.ipynb)
----
## 第4️⃣週
- 文本向量化2
- Static word embeddings
- [Word2vec](https://en.wikipedia.org/wiki/Word2vec) by Google
- [fastText](https://en.wikipedia.org/wiki/FastText) by Facebook
- [GloVe](https://en.wikipedia.org/wiki/GloVe) (Global Vectors) by the Stanford NLP team
- [文本聚類](https://en.wikipedia.org/wiki/Document_clustering)
- [k-means clustering](https://en.wikipedia.org/wiki/K-means_clustering)
- Notebook: 沿用第3️⃣週的Notebook
----
## 第5️⃣週
- 文本向量化3
- Dynamic embeddings
- [USE](https://tfhub.dev/google/universal-sentence-encoder/4) (Universal Sentence Encoder)
- [BERT](https://en.wikipedia.org/wiki/BERT_(language_model)) (Bidirectional Encoder Representations from Transformers)
- [文本相似性](https://en.wikipedia.org/wiki/Semantic_similarity#In_natural_language_processing)
- Notebook [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/howard-haowen/NLP-demos/blob/main/NSYSU/W05-transformer-and-document-similarity.ipynb)
----
## 第6️⃣週
- [文本分類](https://en.wikipedia.org/wiki/Document_classification): 傳統機器學習
- 分類算法
- [Naive Bayes classifiers](https://www.geeksforgeeks.org/naive-bayes-classifiers/?ref=leftbar-rightbar)
- [Support Vector Machines](https://www.geeksforgeeks.org/support-vector-machine-algorithm/?ref=gcse)
- [Logistic Regression](https://www.geeksforgeeks.org/understanding-logistic-regression/?ref=gcse)
- Notebook [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/howard-haowen/NLP-demos/blob/main/NSYSU/W06-text-classification-with-scikit-learn.ipynb)
----
## 第7️⃣週
- [文本分類](https://en.wikipedia.org/wiki/Document_classification): 神經網絡
- 評估指標
- Accuracy
- Recall
- Precision
- F1
- [ROC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)
- Notebook [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/howard-haowen/NLP-demos/blob/main/NSYSU/W07-text-classification-with-spacy.ipynb)
----
## 第8️⃣週
- [命名實體](https://en.wikipedia.org/wiki/Named-entity_recognition)
- Notebook [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/howard-haowen/NLP-demos/blob/main/NSYSU/W08-extracting-named-entities-with-spacy.ipynb)
{"metaMigratedAt":"2023-06-17T01:14:05.022Z","metaMigratedFrom":"YAML","breaks":true,"slideOptions":"{\"theme\":\"night\",\"transition\":\"convex\",\"slideNumber\":true,\"keyboard\":true,\"parallaxBackgroundImage\":\"https://s3.amazonaws.com/hakim-static/reveal-js/reveal-parallax-1.jpg\",\"spotlight\":{\"enabled\":false}}","contributors":"[{\"id\":\"03b5a868-6e13-4235-865b-93b8daff827d\",\"add\":3349,\"del\":1804}]","title":"國立中山大學NLP Workshop簡報"}