2021-01-26 - HackMD

2021-01-26 === ###### tags: `Sequential API` `KNN` `K-Means` #### word vector https://shuuki4.wordpress.com/2016/01/27/word2vec-%EA%B4%80%EB%A0%A8-%EC%9D%B4%EB%A1%A0-%EC%A0%95%EB%A6%AC/ #### batch size vs number of epochs - batch size: batch size는 한 번의 batch마다 주는 데이터 샘플의 size. - number of epochs: 모든 데이터 셋을 다 학습한 횟수 - iteration: 하나의 batch로 한 번 학습 즉, 100개의 데이터에 대해 batch_size 50으로 epoch 100으로 돌렸다면? 총 iteration은 200일 것 #### Sequential API tf.keras.Sequential은 케라스를 활용 ``` python model = tf.keras.Sequential([ layers.Embedding(vocab_size, emb_size, input_length = 4), layers.Lambda(lambda x: tf.reduce_mean(x, axis = 1)), layers.Dense(hidden_dimension, activation='relu'), layers.Dense(output_dimension, activation='sigmoid')]) model.compile(optimizer=tf.keras.optimizers.Adam(0.001), loss='binary_crossentropy', metrics=['accuracy']) ``` * Embedding 이후의 값을 알고 싶다면? 0번쨰 Layer 에서 .get_weights() 실행 `model.layers[0].get_weights()` * 특정 단어에 대한 Embedding값을 알고 싶다면? 단어의 index에 대해 찍어보면 됨 `model.layers[0].get_weights()[4]` :question: 질문 #### 1. 왜 Embedding Layer에 input dimension은 vocabulary size 보다 크게 설정하는가? * vocab_size가 len+1인 이유: OoV token (단어사전에 없는 토큰) 자리? * 1 안 더하면 오류 `InvalidArgumentError: indices[1,3] = 20 is not in [0, 20)` #### Pooling의 목적 1. input size를 줄임. : 여러번 convolution layer을 반복하게 되는데, 별로 필요하지 않은 자료까지 전부를 다 분석할 필요가 없다. 특징만 뽑아내서, 학습하는 것이 합리적이지 않겠는가? 2. overfitting을 조절 : input size가 줄어드는 것은 그만큼 쓸데없는 parameter의 수가 줄어드는 것이라고 생각할 수 있다. 훈련데이터에만 높은 성능을 보이는 과적합(overfitting)을 줄일 수 있다. 3. 특징을 잘 뽑아냄. : pooling을 했을 때, 특정한 모양을 더 잘 인식할 수 있음. 출처: https://supermemi.tistory.com/16 [SuperMemi's Study] ### KNN classifier(지도학습) data: Train 데이터에서 부터 군집의 값(label)이 존재한다. train: 데이터들을 좌표에 그린다. predict: Test 데이터 기준으로 좌표에서 K개의 가까운 점을 찾고, K개 점들의 라벨중 갯수가 가장많은 것으로 분류된다. ### K-Means classifier(비지도학습) data: Train 데이터에 라벨값이 없다. train: K개의 군집으로 분류되며 군집마다 중심점이 있고, Train 데이터를 돌면서 이 중싱점을 업데이트 한다. 모든 Train 데이터가 들어왔을 때 최종 중심점을 기준으로 군집이 나뉜다. predict: test input은 가장 가까운 군집의 중심점을 기준으로 분류된다.