# 簡易NLP練習
開發環境 : pytorch & spyder
---
1. Word Embedding介紹
2. 程式練習
3. 參考資料
---
## 一、Word Embedding介紹:
Word Embedding(詞嵌入),是文字形資料前處理的方式,常見的方式除了最傳統的one-hot encoding外,還有word2vec, Doc2Vec等方式。
* ### One-hot Encoding:
一種將文字轉成數值型態的方法,其概念也非常簡單,那就是以序列當中詞的 Index 來標注該詞。假設世上只有五個單字分別為apple、banana、like、hello、hi,那麼其分別對應的一為矩陣分別為:
apple = [1, 0, 0, 0, 0]
banana = [0, 1, 0, 0, 0]
like = [0, 0, 1, 0, 0]
hello = [0, 0, 0, 1, 0]
hi = [0, 0, 0, 0, 1]
* ### Word2Vec:
一種將文轉換成向量形式的方法,能將文字轉換成指定維度的向量型態,不像One-hot Encoding需要大量的矩陣來做儲存,而轉換的向量也會根據詞性的相似程度而相近。

## 二、程式練習:
#### 匯入所需函式庫:
```p
import torch
from torch import nn, optim
import torch.nn.functional as F
from torch.autograd import Variable
#預訓練文字串
text = '''隨著台灣武漢肺炎疫情趨緩,
加上政府紓困措施奏效,
新竹科學園區減班休息人數驟降至63人,
短短不到1個月時間,
減少超過9成。'''
trigram = [((text[i], text[i + 1]), text[i + 2])
for i in range(len(text) - 3)]
```
---
#### 文字前處理
```python=
vocab = set(text)#選出獨立字
print('vocab =', len(vocab))#查看字典大小
#文字與索引值得轉換
word2idx = {word : i for i, word in enumerate(vocab)}
idx2word = {word2idx[word]: word for word in word2idx}
#print('word_to_idx = ', word2idx)
#print('\nidx_to_word = ', idx2word)
```
機器看的中文字

將文字('武')轉換成向量後

---
#### 建立模型與設定訓練參數
```python=
class NLP(nn.Module):
def __init__(self, vocab_size, context_size, n_dim):
super(NLP, self).__init__()
self.n_word = vocab_size
self.word_embedding = nn.Embedding(self.n_word, n_dim)
self.linear1 = nn.Linear(context_size * n_dim, 128)
self.linear2 = nn.Linear(128, self.n_word)
def forward(self, x):
emb = self.word_embedding(x)
emb = emb.view(1, -1)
out = self.linear1(emb)
out = F.relu(out)
out = self.linear2(out)
log_prob = F.log_softmax(out)
return log_prob
nlp = NLP(len(vocab), 2, 5)
#print(nlp)
criterion = nn.NLLLoss()
optimizer = optim.SGD(nlp.parameters(), lr=1e-3)
```
#### 訓練開始
```python=
for epoch in range(250):
print('epoch: {}'.format(epoch + 1))
running_loss = 0
for data in trigram:
word, label = data
word = Variable(torch.LongTensor([word2idx[i] for i in word]))
label = Variable(torch.LongTensor([word2idx[label]]))
out = nlp(word)
loss = criterion(out, label)
running_loss += loss
optimizer.zero_grad()
loss.backward()
optimizer.step()
print('Loss: {:.6f}'.format(running_loss / len(word2idx)))
```
訓練個250次左右Loss值約為0.93左右

word = Variable(torch.LongTensor([word2idx[i] for i in word]))
out = nlp(word)
_, predict_label = torch.max(out, 1)
predict_word = predict_label[0].data.numpy()
print(predict_word)
predict_word = predict_word.tolist()
print(idx2word[predict_word])
print(word2idx)
```

三、參考資料
* https://medium.com/pyladies-taiwan/%E8%87%AA%E7%84%B6%E8%AA%9E%E8%A8%80%E8%99%95%E7%90%86%E5%85%A5%E9%96%80-word2vec%E5%B0%8F%E5%AF%A6%E4%BD%9C-f8832d9677c8
* https://www.pytorchtutorial.com/10-minute-pytorch-7/
* https://clay-atlas.com/blog/2019/11/26/nlp-word-embedding-%E7%AD%86%E8%A8%98/
* https://www.itread01.com/content/1541454604.html