KoBERT + Sphere 게시판 classification

# KoBERT + Sphere 게시판 classification ### 1. Elastic Cloud에서 데이터 추출 Elastic Cloud의 Kibana에서 모든 document를 export하는 방법을 찾지 못해 Elastic Search API를 활용해서 게시판 모든 데이터를 .json형식으로 추출했다. ![](https://i.imgur.com/SrRIvAF.png) [링크](https://apipheny.io/import-json-google-sheets/)를 참조하여 json to excel로 변환하고 colab에서 작업하기 쉽게 google sheet로 저장했다. ### 2. 데이터 라벨링 총 800건 정도 되는 글을 카테고리 별로 라벨링하는데 꾀 많은 시간이 필요했다. 라벨링을 진행하면서 중간중간 카테고리를 수정해야하는 상황도 발생해서 시간을 더 지채했다. ![](https://i.imgur.com/OfI6aVP.png) 데이터 양에 비해 라벨종류가 너무 많은게 아닌가 싶었지만, 일단 첫 시도때 라벨은 다음과 같이 정의했다. ``` 소통: 운영자의 답변이 필요없는 커뮤니티성 글 후기: 이벤트 후기 공지: 운영자 공지 소음: 소음관련 문의 네트워크: 인터넷 연결관련 문의 시설: 시설관련 VoC (예: 모니터가 안되요, 화장실 문 고장났어요, 너무 더워요) 예약: 회의실 & 좌석 예약 관련 (월패드, 키오스크, 모바일 포함) 이벤트: 이벤트 관련 문의 주차: 주차장 관련 문의 분실물: 분실물 관련 문의 기타건의: 기타 건의 or 문의 삭제: 테스트성 글 (train 데이터에서 제외) ``` ### 3. Dataset 만들기 Googlesheet를 dataframe으로 변환 후 필요한 컬럼만 추출하고 테스트성 글들은 제거했다. ```python # 구글 시트 연결 from google.colab import auth import gspread from google.auth import default auth.authenticate_user() creds, _ = default() gc = gspread.authorize(creds) # 구글 시트를 dataframe으로 변환 import pandas as pd def gsheet_to_dataframe(gsheet_name): worksheet = gc.open(gsheet_name).sheet1 rows = worksheet.get_all_values() df = pd.DataFrame(rows) df.columns = df.iloc[0] df = df.iloc[1:] df = df.filter(items=["title", "content", "category"]) df = df[(df.category != "삭제")] return df ``` ![](https://i.imgur.com/pxKgwFx.png) KoBERT의 vocab 리스트를 가져와서 데이터를 BertDataset으로 변환해주었는데, 게시글에서 500개는 train 데이터로 나머지는 test 데이터로 분리해서 dataset를 만들었다. ```python class BERTDataset(Dataset): def __init__(self, dataset, sent_idx, label_idx, bert_tokenizer, max_len, pad, pair): transform = nlp.data.BERTSentenceTransform( bert_tokenizer, max_seq_length=max_len, pad=pad, pair=pair) self.sentences = [transform([i[sent_idx]]) for i in dataset] self.labels = [np.int32(i[label_idx]) for i in dataset] def __getitem__(self, i): return (self.sentences[i] + (self.labels[i], )) def __len__(self): return (len(self.labels)) ``` title과 content를 concat해서 sentence 필드로 정의하고 category 컬럼값은 factorize로 int로 변환해서 label이란 필드로 정의했다. ```python df['sentence'] = df['title'] + ' ' + df['content'] df['label'] = pd.factorize(df['category'])[0] dataset_train = df.values[:500] dataset_test = df.values[500:] vocab = get_pytorch_kobert_model() tokenizer = get_tokenizer() tok = nlp.data.BERTSPTokenizer(tokenizer, vocab, lower=False) # BERTDataset 클래스 이용, TensorDataset으로 만들어주기 3번째 index에 sentence가 있고 4번째 index에 label값이 있다. data_train = BERTDataset(dataset_train, 3, 4, tok, max_len, True, False) data_test = BERTDataset(dataset_test, 3, 4, tok, max_len, True, False) ``` ![](https://i.imgur.com/a08u2ls.png) ### 4. 모델 학습시키기 SKTBrain이 기본으로 제공하는 BERT classifier ```python class BERTClassifier(nn.Module): def __init__(self, bert, hidden_size = 768, num_classes= 11, # 시설, 소통, 공지, 소음, 분실물, 네트워크, 예약, 이벤트, 기타건의, 주차, 후기 dr_rate=None, params=None): super(BERTClassifier, self).__init__() self.bert = bert self.dr_rate = dr_rate self.classifier = nn.Linear(hidden_size , num_classes) if dr_rate: self.dropout = nn.Dropout(p=dr_rate) def gen_attention_mask(self, token_ids, valid_length): attention_mask = torch.zeros_like(token_ids) for i, v in enumerate(valid_length): attention_mask[i][:v] = 1 return attention_mask.float() def forward(self, token_ids, valid_length, segment_ids): attention_mask = self.gen_attention_mask(token_ids, valid_length) _, pooler = self.bert(input_ids = token_ids, token_type_ids = segment_ids.long(), attention_mask = attention_mask.float().to(token_ids.device)) if self.dr_rate: out = self.dropout(pooler) return self.classifier(out) ``` ```python # 만약 gpu 사용이 가능하다면 device = torch.device("cuda:0") model = BERTClassifier(bertmodel, dr_rate=0.5).to(device) # 불가능하면 model = BERTClassifier(bertmodel, dr_rate=0.5) ``` Optimizer와 Scheduler ```python no_decay = ['bias', 'LayerNorm.weight'] optimizer_grouped_parameters = [ {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01}, {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0} ] optimizer = AdamW(optimizer_grouped_parameters, lr=learning_rate) loss_fn = nn.CrossEntropyLoss() t_total = len(train_dataloader) * num_epochs warmup_step = int(t_total * warmup_ratio) scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=warmup_step, num_training_steps=t_total) def calc_accuracy(X,Y): max_vals, max_indices = torch.max(X, 1) train_acc = (max_indices == Y).sum().data.cpu().numpy()/max_indices.size()[0] return train_acc ``` 아래 스크립트에서 gpu 사용이 가능하면 .to(device)를 모두 붙여준다. ```python for e in range(num_epochs): train_acc = 0.0 test_acc = 0.0 model.train() for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(tqdm_notebook(train_dataloader)): optimizer.zero_grad() token_ids = token_ids.long() # .to(device) segment_ids = segment_ids.long() # .to(device) valid_length= valid_length label = label.long() # .to(device) out = model(token_ids, valid_length, segment_ids) loss = loss_fn(out, label) loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm) optimizer.step() scheduler.step() # Update learning rate schedule train_acc += calc_accuracy(out, label) if batch_id % log_interval == 0: print("epoch {} batch id {} loss {} train acc {}".format(e+1, batch_id+1, loss.data.cpu().numpy(), train_acc / (batch_id+1))) print("epoch {} train acc {}".format(e+1, train_acc / (batch_id+1))) model.eval() for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(tqdm_notebook(test_dataloader)): token_ids = token_ids.long() # .to(device) segment_ids = segment_ids.long() # .to(device) valid_length= valid_length label = label.long() # .to(device) out = model(token_ids, valid_length, segment_ids) test_acc += calc_accuracy(out, label) print("epoch {} validation acc {}".format(e+1, test_acc / (batch_id+1))) ``` 필자는 colab에서 gpu사용량이 초과해서 cpu로 돌렸더니 한세월 걸린다. ![](https://i.imgur.com/yrBAw73.png)