2021-02-23
===
###### tags: `preprocessing` `한글`
- 한글 형태소 분석기
Komoran 추천추천 강추천
- 임의 텍스트를 sequence 벡터로 표현
```python
!pip install konlpy
from keras.preprocessing.text import text_to_word_sequence
from konlpy.tag import Okt
import json
import re
def preprocessing(review):
okt = Okt()
stop_words = set(['은', '는', '이', '가', '하', '아', '것', '들','의', '있', '되', '수', '보', '주', '등', '한'])
review_text = re.sub("[^가-힣ㄱ-ㅎㅏ-ㅣ\\s]", "", review)
word_review = okt.morphs(review_text, stem=True)
word_review = [token for token in word_review if not token in stop_words]
return word_review
def to_sequence(text, word_index):
tokens = preprocessing(text)
print(tokens)
return [word_index.get(w) for w in tokens if w in word_index]
with open('/content/drive/MyDrive/SKT/Machine Learning Learning/data_in_4.2.2/data_configs.json') as json_file:
json_data = json.load(json_file)
word_index = json_data['vocab']
print(json_data['vocab'])
sequence = to_sequence('영화를 겁나 재밌게 봤어요', word_index)
print(sequence)
```
- Sequence를 Model로 predict
```python
from tensorflow.keras.preprocessing.sequence import pad_sequences
input = '영화 겁나 재미없다'
print('input: ' + input)
sequence = to_sequence(input, word_index)
sequence = [sequence]
model.predict(pad_sequences(sequence, 8))
```
- Result
```
input: 영화 겁나 재미없다
['영화', '겁나다', '재미없다']
array([[0.00208169]], dtype=float32)
0.00208169 -> 0(부정)과 가까운 값
```
```
input: 영화 겁나 재밌다
['영화', '겁나다', '재밌다']
array([[0.99173295]], dtype=float32)
0.99173295 -> 1(긍정)과 가까운 값
```