# Toxicity Classification ## Preprocessing We try to convert all sentecne to lower case in LSTM model, but normal case gets highly score in LB than lower case. Preprocessing as follows: * all http(url) were substituted to url * all emoji substitute to ' ' * using flashtext to find mispell words and replace to true words * all emoji were substituted by ' ' * \n\t were substituted by ' ' * \s{2,} were substituted by ' ' ## Get statistics features We try to get some statistics feature and put in LSTM model training We get statistics feature as follows: * swear word * upper word * uniqen word * emoji * characters ## Embedding We used pretrainned word embedding as follows: * Fasttext * Glove In our works, fasttext is a little bit better than Glove. We didn't concatenate fasttext and Glove due to time consuming. (However, in the nearly end of the competition, everyone used BERT model haha.) ## Model ### LSTM model Our lstm model are different with public version, it comsisted of lstm cells without gru cells. * Attention didn't improve LB significantly. * Spatial Dropout had improvement in LB. * blending of three models, each lstm model got the LB 0.935x~0.938x. After blended three models, we got the LB 0.93963. ### BERT model We used pretrained bert model from: [pytorch-pretrained-BERT](https://github.com/huggingface/pytorch-pretrained-BERT) and `BertForSequenceClassification` for sequence classification. * The result with text preprocessing or without text preprocessing are similar. * The batch size from 16 to 32 improve the LB. I think the batch size significantly influnces acurracy * The learning rate we set is `2e-5` * We got the single model with LB 0.9415x~0.94220 * We ensemble five single BERT model and got LB 0.94294 ### GPT2 model * We only got LB around 0.938 in single GPT2. Therefore we focused on training BERT models. ### Ensemble model * We used ensemble of 3 LSTM models and ensemble of 5 BERT models and blended them with weights 0.3 and 0.7 respectively. * In the end, we got LB 0.9443