# AG nwes * dataset: https://www.kaggle.com/datasets/amananandrai/ag-news-classification-dataset * goal: Classification(four classes) --- # **LSTM** # implement * ref: https://paperswithcode.com/paper/revisiting-lstm-networks-for-semi-supervised-1 * use pytorch * clean data by `re.sub(r"[^a-z!@#$%&*':\"]", ' ', t)` * remove stopwords in `from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS` * use a word freqency vocab * optimizer is Adam(beta=(0, 0.98)) * loss function is CatogoryEntropyLoss * somtimes cut long sentences * batch_size = 64 * num_epochs = 12 * hidden = 256 * vocab_size = 10000 * lr = 1e-4 ``` class Model(nn.Module): def __init__(self, num_labels, num_embeddings, hidden): super(Model, self).__init__() self.embed = nn.Embedding(num_embeddings=num_embeddings,embedding_dim=hidden) self.lstm1 = nn.LSTM(input_size=hidden,hidden_size=hidden,num_layers=2, dropout=0.4, bidirectional=True) self.linear1 = nn.Linear(int(hidden*2),256) self.linear4 = nn.Linear(256,64) self.linear6 = nn.Linear(64,num_labels) self.dropout = nn.Dropout(0.4) self.relu = nn.ReLU() self.pool = nn.AdaptiveMaxPool1d(output_size=1) def forward(self, x): x = self.embed(x) x, _ = self.lstm1(x) x = self.pool(torch.transpose(x,1,2)).squeeze() x = self.linear1(x) x = self.relu(x) x = self.dropout(x) x = self.linear4(x) x = self.dropout(x) x = self.linear6(x) return x ``` # Result: **model0:** 1. score: accuracy: 91.5% recall: * label0: 90.3% * label1: 97.3% * label2: 88% * label3: 90.5% precision: * label0: 93.4% * label1: 96.1% * label2: 88.6% * label3: 88.1% probability(output after softmax): * when prediction correct: 94.2% * when prediction wrong: 73.7% | true/pred | 0 | 1 | 2 | 3 | | --------- | --- | --- | --- | --- | | 0 | 0 | 79 | 126 | 75 | | 1 | 22 | 0 | 26 | 16 | | 2 | 61 | 19 | 0 | 193 | | 3 | 61 | 20 | 191 | 0 | # Analyze: **model0:** 1. **sequence length:** it have no relation to accuracy ![](https://hackmd.io/_uploads/SyxX6tGzT.png =80%x) 2. **vocab:** * check testdata's words by traindata's vocab * 9543 words in traindata's vocab appear in testdata * four classes have similar vocab size ![](https://hackmd.io/_uploads/BJw4jqfMT.png =80%x) 3. **word frequency:** * averagly have no relation to accuracy, but higher frequency will have more stable performance ![](https://hackmd.io/_uploads/Hk01C9fG6.png) label_0_words: ![](https://hackmd.io/_uploads/HkzyIsfM6.jpg =80%x) label_1_words: ![](https://hackmd.io/_uploads/HkZeIizz6.jpg =80%x) label_2_words: ![](https://hackmd.io/_uploads/SysxUoffp.jpg =80%x) label_3_words: ![](https://hackmd.io/_uploads/r1ZZIoMMT.jpg =80%x) 4. **words variance** acc_variance: * high accuracy words usually have high variance ![](https://hackmd.io/_uploads/HJ1aGqDzT.jpg =80%x) ![](https://hackmd.io/_uploads/HJk6McDGp.jpg =80%x) 5. **words accuracy:** * accuracy 100%: 5422 words * accuracy > 91.5%: 6567 words * accuracy > 50%: 9397 words * accuracy > 0%: 9448 words * accuracy 0%: 95 words all_words: ![](https://hackmd.io/_uploads/H1K6kiGMa.png =80%x) # modify: **model0:** * select top10000 frequent words => select top30000 frequent words then select top10000 variance words, accuracy increase to 92% **model1:** * use pre_trained embedding fasttext(wiki-news-300d-1M) and glove(glove.6B.100d), but both of them cannot increase accuracy * when using pre-trained embedding, not to freeze embedding layer has higher accuracy than freeze embedding layer --- # **BERT** # implement * use pytorch * clean data by `re.sub(r"[^a-z!@#$%&*':\"]", ' ', t)` * optimizer is Adam * loss function is CatogoryEntropyLoss * sentences length is 64 * batch_size = 64 * num_epochs = 5 * vocab_size = 30000 * lr = 2e-5 ``` class BSC(BertPreTrainedModel): def __init__(self, config, num_labels): super(BSC, self).__init__(config=config, num_labels=num_labels) self.num_labels = num_labels self.bert = BertModel(config) self.dropout = nn.Dropout(0.1) self.linear1 = nn.Linear(config.hidden_size, 128) self.linear2 = nn.Linear(128, num_labels) self.relu = nn.ReLU() def forward(self, x, m, t): output = self.bert(input_ids=x, attention_mask=m, token_type_ids = t)["pooler_output"] logits = self.linear1(output) logits = self.relu(logits) logits = self.dropout(logits) logits = self.linear2(logits) return logits ``` # Result: **model0:** 1. score: accuracy: 90.7% recall: * label0: 86.7% * label1: 98.2% * label2: 90% * label3: 88.1% precision: * label0: 95.1% * label1: 94.6% * label2: 84.7% * label3: 89.2% probability(under softmax): * when prediction correct: 94.8% * when prediction wrong: 77.7% | true/pred | 0 | 1 | 2 | 3 | | --------- | --- | --- | --- | --- | | 0 | 0 | 81 | 128 | 71 | | 1 | 21 | 0 | 31 | 12 | | 2 | 62 | 20 | 0 | 191 | | 3 | 58 | 20 | 194 | 0 | # Analyze: **model0:** 1. **sequence length:** * already been cut to 64 2. **vocab:** * bert_base_uncased's full dict size is 30522 * check testdata's words by traindata's vocab * 17497 words in traindata's vocab also in bert's vocab * 22504 words are unk in bert's vocab * 12016 words in testdata also in traindata's vocab and bert's vocab * only discuss words in bert's vocab ![](https://hackmd.io/_uploads/HkeuprcMa.png =80%x) 3. **word frequency:** * averagly have no relation to accuracy, but higher frequency will have more stable performance ![](https://hackmd.io/_uploads/ByomEU9M6.png =80%x) label_0_words: ![](https://hackmd.io/_uploads/rJQBSUcfT.png =80%x) label_1_words: ![](https://hackmd.io/_uploads/BJqvSL9z6.png =80%x) label_2_words: ![](https://hackmd.io/_uploads/SkDsBLqfp.png =80%x) label_3_words: ![](https://hackmd.io/_uploads/S1F6H8qfT.png =80%x) 5. **words accuracy:** * accuracy 100%: 7596 words * accuracy > 90.7%: 8794 words * accuracy > 50%: 11590 words * accuracy > 0%: 11664 words * accuracy 0%: 351 words all_words: ![](https://hackmd.io/_uploads/S1LR6IcM6.png =80%x) # modify: **model0:** * Q: why so many words not in BERT's vocab? * A: i do not use BERT's tokenizer **model1:** * use BERT's tokenizer to split sentences instead of space(" ") * accuracy increase to 91.5% (origin is 90.7%) --- # LSTM+BERT * if they have same prediction, accuracy is 94% * if they have different prediction, when we choose lstm, accuracy is 52.6%; when we choose bert, accuracy is 39.7% * the probability of that both model are wrong is 6.1% * choose high probabilty one's prediction(under softmax) for each testdata, accuracy is 91.8% same prediction but wrong: | true/pred | 0 | 1 | 2 | 3 | | --------- | --- | --- | --- | --- | | 0 | 0 | 38 | 67 | 40 | | 1 | 2 | 0 | 9 | 3 | | 2 | 30 | 7 | 0 | 99 | | 3 | 25 | 12 | 93 | 0 | different prediction and all wrong | true/pred | 0 | 1 | 2 | 3 | | --------- | --- | --- | --- | --- | | 0 | 0 | 4 | 10 | 12 | | 1 | 3 | 0 | 7 | 6 | | 2 | 5 | 7 | 0 | 8 | | 3 | 5 | 2 | 3 | 0 | different prediction and lstm correct | true/pred | 0 | 1 | 2 | 3 | | --------- | --- | --- | --- | --- | | 0 | 0 | 36 | 42 | 17 | | 1 | 0 | 0 | 12 | 1 | | 2 | 7 | 3 | 0 | 34 | | 3 | 16 | 3 | 73 | 0 | different prediction and bert correct | true/pred | 0 | 1 | 2 | 3 | | --------- | --- | --- | --- | --- | | 0 | 0 | 4 | 13 | 10 | | 1 | 18 | 0 | 4 | 7 | | 2 | 22 | 6 | 0 | 55 | | 3 | 16 | 4 | 25 | 0 |