# AG nwes
* dataset: https://www.kaggle.com/datasets/amananandrai/ag-news-classification-dataset
* goal: Classification(four classes)
---
# **LSTM**
# implement
* ref: https://paperswithcode.com/paper/revisiting-lstm-networks-for-semi-supervised-1
* use pytorch
* clean data by `re.sub(r"[^a-z!@#$%&*':\"]", ' ', t)`
* remove stopwords in `from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS`
* use a word freqency vocab
* optimizer is Adam(beta=(0, 0.98))
* loss function is CatogoryEntropyLoss
* somtimes cut long sentences
* batch_size = 64
* num_epochs = 12
* hidden = 256
* vocab_size = 10000
* lr = 1e-4
```
class Model(nn.Module):
def __init__(self, num_labels, num_embeddings, hidden):
super(Model, self).__init__()
self.embed = nn.Embedding(num_embeddings=num_embeddings,embedding_dim=hidden)
self.lstm1 = nn.LSTM(input_size=hidden,hidden_size=hidden,num_layers=2, dropout=0.4, bidirectional=True)
self.linear1 = nn.Linear(int(hidden*2),256)
self.linear4 = nn.Linear(256,64)
self.linear6 = nn.Linear(64,num_labels)
self.dropout = nn.Dropout(0.4)
self.relu = nn.ReLU()
self.pool = nn.AdaptiveMaxPool1d(output_size=1)
def forward(self, x):
x = self.embed(x)
x, _ = self.lstm1(x)
x = self.pool(torch.transpose(x,1,2)).squeeze()
x = self.linear1(x)
x = self.relu(x)
x = self.dropout(x)
x = self.linear4(x)
x = self.dropout(x)
x = self.linear6(x)
return x
```
# Result:
**model0:**
1. score:
accuracy: 91.5%
recall:
* label0: 90.3%
* label1: 97.3%
* label2: 88%
* label3: 90.5%
precision:
* label0: 93.4%
* label1: 96.1%
* label2: 88.6%
* label3: 88.1%
probability(output after softmax):
* when prediction correct: 94.2%
* when prediction wrong: 73.7%
| true/pred | 0 | 1 | 2 | 3 |
| --------- | --- | --- | --- | --- |
| 0 | 0 | 79 | 126 | 75 |
| 1 | 22 | 0 | 26 | 16 |
| 2 | 61 | 19 | 0 | 193 |
| 3 | 61 | 20 | 191 | 0 |
# Analyze:
**model0:**
1. **sequence length:** it have no relation to accuracy

2. **vocab:**
* check testdata's words by traindata's vocab
* 9543 words in traindata's vocab appear in testdata
* four classes have similar vocab size

3. **word frequency:**
* averagly have no relation to accuracy, but higher frequency will have more stable performance

label_0_words:

label_1_words:

label_2_words:

label_3_words:

4. **words variance**
acc_variance:
* high accuracy words usually have high variance


5. **words accuracy:**
* accuracy 100%: 5422 words
* accuracy > 91.5%: 6567 words
* accuracy > 50%: 9397 words
* accuracy > 0%: 9448 words
* accuracy 0%: 95 words
all_words:

# modify:
**model0:**
* select top10000 frequent words => select top30000 frequent words then select top10000 variance words, accuracy increase to 92%
**model1:**
* use pre_trained embedding fasttext(wiki-news-300d-1M) and glove(glove.6B.100d), but both of them cannot increase accuracy
* when using pre-trained embedding, not to freeze embedding layer has higher accuracy than freeze embedding layer
---
# **BERT**
# implement
* use pytorch
* clean data by `re.sub(r"[^a-z!@#$%&*':\"]", ' ', t)`
* optimizer is Adam
* loss function is CatogoryEntropyLoss
* sentences length is 64
* batch_size = 64
* num_epochs = 5
* vocab_size = 30000
* lr = 2e-5
```
class BSC(BertPreTrainedModel):
def __init__(self, config, num_labels):
super(BSC, self).__init__(config=config, num_labels=num_labels)
self.num_labels = num_labels
self.bert = BertModel(config)
self.dropout = nn.Dropout(0.1)
self.linear1 = nn.Linear(config.hidden_size, 128)
self.linear2 = nn.Linear(128, num_labels)
self.relu = nn.ReLU()
def forward(self, x, m, t):
output = self.bert(input_ids=x, attention_mask=m, token_type_ids = t)["pooler_output"]
logits = self.linear1(output)
logits = self.relu(logits)
logits = self.dropout(logits)
logits = self.linear2(logits)
return logits
```
# Result:
**model0:**
1. score:
accuracy: 90.7%
recall:
* label0: 86.7%
* label1: 98.2%
* label2: 90%
* label3: 88.1%
precision:
* label0: 95.1%
* label1: 94.6%
* label2: 84.7%
* label3: 89.2%
probability(under softmax):
* when prediction correct: 94.8%
* when prediction wrong: 77.7%
| true/pred | 0 | 1 | 2 | 3 |
| --------- | --- | --- | --- | --- |
| 0 | 0 | 81 | 128 | 71 |
| 1 | 21 | 0 | 31 | 12 |
| 2 | 62 | 20 | 0 | 191 |
| 3 | 58 | 20 | 194 | 0 |
# Analyze:
**model0:**
1. **sequence length:**
* already been cut to 64
2. **vocab:**
* bert_base_uncased's full dict size is 30522
* check testdata's words by traindata's vocab
* 17497 words in traindata's vocab also in bert's vocab
* 22504 words are unk in bert's vocab
* 12016 words in testdata also in traindata's vocab and bert's vocab
* only discuss words in bert's vocab

3. **word frequency:**
* averagly have no relation to accuracy, but higher frequency will have more stable performance

label_0_words:

label_1_words:

label_2_words:

label_3_words:

5. **words accuracy:**
* accuracy 100%: 7596 words
* accuracy > 90.7%: 8794 words
* accuracy > 50%: 11590 words
* accuracy > 0%: 11664 words
* accuracy 0%: 351 words
all_words:

# modify:
**model0:**
* Q: why so many words not in BERT's vocab?
* A: i do not use BERT's tokenizer
**model1:**
* use BERT's tokenizer to split sentences instead of space(" ")
* accuracy increase to 91.5% (origin is 90.7%)
---
# LSTM+BERT
* if they have same prediction, accuracy is 94%
* if they have different prediction, when we choose lstm, accuracy is 52.6%; when we choose bert, accuracy is 39.7%
* the probability of that both model are wrong is 6.1%
* choose high probabilty one's prediction(under softmax) for each testdata, accuracy is 91.8%
same prediction but wrong:
| true/pred | 0 | 1 | 2 | 3 |
| --------- | --- | --- | --- | --- |
| 0 | 0 | 38 | 67 | 40 |
| 1 | 2 | 0 | 9 | 3 |
| 2 | 30 | 7 | 0 | 99 |
| 3 | 25 | 12 | 93 | 0 |
different prediction and all wrong
| true/pred | 0 | 1 | 2 | 3 |
| --------- | --- | --- | --- | --- |
| 0 | 0 | 4 | 10 | 12 |
| 1 | 3 | 0 | 7 | 6 |
| 2 | 5 | 7 | 0 | 8 |
| 3 | 5 | 2 | 3 | 0 |
different prediction and lstm correct
| true/pred | 0 | 1 | 2 | 3 |
| --------- | --- | --- | --- | --- |
| 0 | 0 | 36 | 42 | 17 |
| 1 | 0 | 0 | 12 | 1 |
| 2 | 7 | 3 | 0 | 34 |
| 3 | 16 | 3 | 73 | 0 |
different prediction and bert correct
| true/pred | 0 | 1 | 2 | 3 |
| --------- | --- | --- | --- | --- |
| 0 | 0 | 4 | 13 | 10 |
| 1 | 18 | 0 | 4 | 7 |
| 2 | 22 | 6 | 0 | 55 |
| 3 | 16 | 4 | 25 | 0 |