簡易NLP練習

# 簡易NLP練習開發環境 : pytorch & spyder --- 1. Word Embedding介紹 2. 程式練習 3. 參考資料 --- ## 一、Word Embedding介紹: Word Embedding(詞嵌入)，是文字形資料前處理的方式，常見的方式除了最傳統的one-hot encoding外，還有word2vec, Doc2Vec等方式。 * ### One-hot Encoding: 一種將文字轉成數值型態的方法，其概念也非常簡單，那就是以序列當中詞的 Index 來標注該詞。假設世上只有五個單字分別為apple、banana、like、hello、hi，那麼其分別對應的一為矩陣分別為: apple = [1, 0, 0, 0, 0] banana = [0, 1, 0, 0, 0] like = [0, 0, 1, 0, 0] hello = [0, 0, 0, 1, 0] hi = [0, 0, 0, 0, 1] * ### Word2Vec: 一種將文轉換成向量形式的方法，能將文字轉換成指定維度的向量型態，不像One-hot Encoding需要大量的矩陣來做儲存，而轉換的向量也會根據詞性的相似程度而相近。 ![](https://i.imgur.com/iAq5SAa.png) ## 二、程式練習: #### 匯入所需函式庫: ```p import torch from torch import nn, optim import torch.nn.functional as F from torch.autograd import Variable #預訓練文字串 text = '''隨著台灣武漢肺炎疫情趨緩，加上政府紓困措施奏效，新竹科學園區減班休息人數驟降至63人，短短不到1個月時間，減少超過9成。''' trigram = [((text[i], text[i + 1]), text[i + 2]) for i in range(len(text) - 3)] ``` --- #### 文字前處理 ```python= vocab = set(text)#選出獨立字 print('vocab =', len(vocab))#查看字典大小 #文字與索引值得轉換 word2idx = {word : i for i, word in enumerate(vocab)} idx2word = {word2idx[word]: word for word in word2idx} #print('word_to_idx = ', word2idx) #print('\nidx_to_word = ', idx2word) ``` 機器看的中文字 ![](https://i.imgur.com/G7eDVks.png) 將文字('武')轉換成向量後 ![](https://i.imgur.com/nN4gU21.png) --- #### 建立模型與設定訓練參數 ```python= class NLP(nn.Module): def __init__(self, vocab_size, context_size, n_dim): super(NLP, self).__init__() self.n_word = vocab_size self.word_embedding = nn.Embedding(self.n_word, n_dim) self.linear1 = nn.Linear(context_size * n_dim, 128) self.linear2 = nn.Linear(128, self.n_word) def forward(self, x): emb = self.word_embedding(x) emb = emb.view(1, -1) out = self.linear1(emb) out = F.relu(out) out = self.linear2(out) log_prob = F.log_softmax(out) return log_prob nlp = NLP(len(vocab), 2, 5) #print(nlp) criterion = nn.NLLLoss() optimizer = optim.SGD(nlp.parameters(), lr=1e-3) ``` #### 訓練開始 ```python= for epoch in range(250): print('epoch: {}'.format(epoch + 1)) running_loss = 0 for data in trigram: word, label = data word = Variable(torch.LongTensor([word2idx[i] for i in word])) label = Variable(torch.LongTensor([word2idx[label]])) out = nlp(word) loss = criterion(out, label) running_loss += loss optimizer.zero_grad() loss.backward() optimizer.step() print('Loss: {:.6f}'.format(running_loss / len(word2idx))) ``` 訓練個250次左右Loss值約為0.93左右 ![](https://i.imgur.com/DOHWkV1.png #### 接著試著輸入看預測結果 ```python= word, label = trigram[4] print('word :', word, 'label :', label) word = Variable(torch.LongTensor([word2idx[i] for i in word])) out = nlp(word) _, predict_label = torch.max(out, 1) predict_word = predict_label[0].data.numpy() print(predict_word) predict_word = predict_word.tolist() print(idx2word[predict_word]) print(word2idx) ``` ![](https://i.imgur.com/pgEJceS.png) 三、參考資料 * https://medium.com/pyladies-taiwan/%E8%87%AA%E7%84%B6%E8%AA%9E%E8%A8%80%E8%99%95%E7%90%86%E5%85%A5%E9%96%80-word2vec%E5%B0%8F%E5%AF%A6%E4%BD%9C-f8832d9677c8 * https://www.pytorchtutorial.com/10-minute-pytorch-7/ * https://clay-atlas.com/blog/2019/11/26/nlp-word-embedding-%E7%AD%86%E8%A8%98/ * https://www.itread01.com/content/1541454604.html