# HW4 程式教學 - 本 Tutorial 使用 Keras 來實作,想用 TF 或 PyTorch 的話自己想辦法 (`д´) ## Part0: Load Data ```python= import pandas as pd import numpy as np import re def load_data(data_path, label_path): with open(data_path, 'r', encoding='utf-8') as f: readin = f.readlines() # Use regular expression to get rid of the index sentences = [re.sub('^[0-9]+,', '', s) for s in readin[1:]] labels = pd.read_csv(label_path)['label'] labels = np.array(labels) return sentences, labels ``` ## Part1: Word Segmentation ```python= import jieba jieba.set_dictionary(path_to_dict) # Change dictionary (Optional) sentences = [list(jieba.cut(s, cut_all=False)) for s in sentences] ``` - 如果要使用繁體辭典,助教批改時會提供 path_to_dict 路徑,請勿寫死! ## Part2: Train Word2Vec Model ```python= from gensim.models import Word2Vec # Train Word2Vec model emb_model = Word2Vec(sentences, size=emb_dim) num_words = len(emb_model.wv.vocab) + 1 # +1 for OOV words # Create embedding matrix (For Keras) emb_matrix = np.zeros((num_words, emb_dim), dtype=float) for i in range(num_words - 1): v = emb_model.wv[emb_model.wv.index2word[i]] emb_matrix[i+1] = v # Plus 1 to reserve index 0 for OOV words ``` - emb_dim 是 embedding 的維度,可以自行設定。 - Word2Vec 有許多參數,如 iter, min_count 等,同學可以自行實驗不同參數效果。 - Testing data 也可以一併用於訓練! - 訓練好的 Word2Vec model 可以存起來並一起上傳 Github。 - OOV 處理:可以選擇留一個 index 給 OOV words,或是直接刪除句子中的 OOV words。 ## Part3: Tokenizing and Padding ```python= from keras.preprocessing.sequence import pad_sequences # Convert words to index train_sequences = [] for i, s in enumerate(sentences): toks = [emb_model.wv.vocab[w].index + 1 for w in s] # Plus 1 to reserve index 0 for OOV words train_sequences.append(toks) # Pad sequence to same length train_sequences = pad_sequences(train_sequences, maxlen=max_length) # Split validation data (#TODO) (X_train, Y_train), (X_val, Y_val) = split_data(train_sequences, Y_data, SPLIT_RATIO) ``` - Tokenize:將每個字詞轉換為對應的 index - Padding:將每則留言 pad 成一樣長度,超過 max_length 切掉,不足則補 0 - 切 validation 的 function 自己寫! ## Part4: Build Model ```python= from keras.models import Sequential from keras.layers import Dense, Dropout from keras.layers import LSTM from keras.layers.embeddings import Embedding from keras.optimizers import Adam model = Sequential() model.add(Embedding(num_words, emb_dim, weights=[embedding_matrix], input_length=max_length, trainable=False)) ######################################################### # Design your own model! model.add(LSTM(256, dropout=0.5, recurrent_dropout=0.5)) model.add(Dense(256, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(1, activation='sigmoid')) model.summary() # Example: Simple Baseline Model # - model.add(LSTM(100, recurrent_dropout=0.5)) # - model.add(Dense(1, activation='sigmoid')) # Train for 1 epoch, that's it! ######################################################### # Setting optimizer and compile the model adam = Adam(lr=0.001, decay=1e-6, clipvalue=0.5) model.compile(loss='binary_crossentropy', optimizer=adam, metrics=['accuracy']) ``` - 同學可以自己嘗試不同的 model 設計 (俗稱疊積木) - 也可以嘗試 Bidirectional RNN、GRU、多層 RNN、Attention(進階) 等不同架構 - Hint:Strong baseline 所使用的 model 只需要使用上面 sample code 裡有出現的 component 就可以疊得出來! ## Part5: Callback Functions and Training ```python= from keras.callbacks import CSVLogger from keras.callbacks import ModelCheckpoint from keras.callbacks import EarlyStopping # Setting callback functions csv_logger = CSVLogger('training.log') checkpoint = ModelCheckpoint(filepath='models-keras/best.h5', verbose=1, save_best_only=True, monitor='val_acc', mode='max') earlystopping = EarlyStopping(monitor='val_acc', patience=6, verbose=1, mode='max') # Train the model model.fit(X_train, Y_train, validation_data=(X_val, Y_val), epochs=epochs, batch_size=batch_size, callbacks=[earlystopping, checkpoint, csv_logger]) ``` - Callback functions 是訓練過程中會呼叫的一些功能,這邊所使用的 callback 都是 optional 的,不使用也可以 train - CSVLogger:紀錄訓練過程 - EarlyStopping:在 val_acc 不再上升時 (開始overfit) 提早結束 training - ModelCheckpoint:存下目前最好的 model ## Part6: Prediction ```python= pred = model.predict(test_sequences) # That's all ``` - predict 的結果會是一個介於 0~1 之間的數字