Advanced Machine Learning Homework 1

--- tags: AML title: Advanced Machine Learning Homework 1 --- **Due Date:** 15.04.2021 09:10 **Submission Format:** GitHub repository link and report (PDF). [More about writing reports](https://hackmd.io/s/H1PMURS37). [Report example](https://www.dropbox.com/s/aykdpgxx8v4ql2s/HW_9_Report.pdf?dl=0). **Data:** [Train and Test](https://github.com/Gci04/AML-DS-2021/tree/main/data) **Acknowledgements:** Vitaly Romanov ## Classification Task Description In this assignment, you are going to solve the task of name gender classification. The data has the following format ``` Name,Gender Terrone,M Annaley,F ``` The goal is to classify the gender by reading the name using a character level model. The data is provided in the form of [train](https://github.com/Gci04/AML-DS-2021/tree/main/data) and [test](https://github.com/Gci04/AML-DS-2021/tree/main/data) splits. Use accuracy to evaluate model performance. :::info You can also perform additional tests on Russian Names dataset. The russian data available in the data folder. ::: ## Data Preprocessing The data is stored in CSV format. You can easily fit the data into memory. We suggest to read the CSV file with pandas and then convert the DataFrame to numpy array or Tensor. The dataset is a collection of strings of variable length. The labels for the training samples are also strings. This format is not very friendly for learning algorithms. Further, we are going to discuss how to preprocess the data before passing to the training algorithm. ### Machine Readable The simplest way to convert the string representation into the machine-readable format is to substitute the characters with a unique integer identifier. This can be easily achieved by creating the character vocabulary. Assume you have read the CSV and converted the data into numpy's ndarray ```python unique = list(set("".join(data[:,0]))) unique.sort() vocab = dict(zip(unique, range(1,len(unique)+1))) ``` Here we start indexing with 1 to handle variable length padding. Learning algorithms in general are bad in handling variable length input. Even recurrent networks actually have all the inputs of the fixed size to optimize computations. To handle variable length names, find the length of the longest name and use it as a maximum length. If by some chance you encounter the name longer than the specified maximum length - crop it. Normalize every name to the format, where the letters are represented with their identifiers, and excessive positions are padded with zeros. For example, the name `Elizabeth` is converted to ``` [ 5 38 35 52 27 28 31 46 34 0 0 0 0 0 0] ``` where the maximum length of a name is 15. ### Character Embeddings On their own, integers are not a very good representation of the data. Especially for neural networks, which inherently assume all the data belonging to continuous space. One of the approaches to convert integer sequence representation in something meaningful is to create embeddings for every character. For our dataset of English names, we have 52 unique characters. The simplest type of embeddings - one-hot embeddings If you decide to have trainable embeddings, the strategy is similar to the case of `Word2Vec`. We create a variable that stores the embeddings, and then retrieve embeddings. In PyTorch `nn.Embedding` can be used to create trainable embedding layer. ## Classification Models ### Baseline LSTM One of the simplest classification models for this task is LSTM model with one dense layer. To create LSTM network, first, you need to create an [LSTM layer](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html). ```python #PyTorch lstm_layer = nn.LSTM(input_size=input_size, hidden_size=dimension, num_layers=1, batch_first=True, bidirectional=False) #Tensorflow 2. lstm_layer = tf.keras.layers.LSTM(...) ```  This is the baseline LSTM model. Assume the following label mapping: `F->0` and `M->1`. For the baseline model, create trainable embeddings of size 5, create LSTM layer with the hidden state size of 5, train with Adam optimizer for 100 epochs with learning rate 0.001. **Tip:** sort names by length before creating mini-batches, to help the model to learn shorter sequences first. ### Classical Neural Network Since we fixed the maximum length of a word, why don't we apply a regular neural network? Simple NN requires the input tensor to have the rank of two. The simplest way to make the data suitable for NN input layer is by flattening the output of the embedding layer. ```python reshape(embedded_names, shape=(-1, T*embedding_size)) lstm_output.view(T*embedding_size, -1) #T: maximum name length, the maximum length of input sequence ``` There are also other ways to flatten the input: #### Maxpool You can use `reduce_maxpool` along `T` or `emb_size` axis to flatten the input. #### Average You can use `reduce_mean`/`tensor.mean` along `T` or `emb_size` axis to flatten the input. #### Weighted Average Instead of using mean value, you can apply weighted average along `T` or `emb_size` axis to flatten the input ## Hyperparameter Tuning The performance of a model depends on the number of hyper-parameters. For a regular neural network, this will be - learning rate - layer sizes - activation functions - number of epochs - etc In this assignment, you are going to implement several classification models and try to find the best architecture and parameters based on the model performance on the test set. During the process of parameter search, keep in mind that complex models are not always the best. You should do a simple sanity check by comparing the performance on the test and training data. If your model performs very well on the training data, you are probably overfitting. When comparing different models keep in mind that there are multiple dimension of comparison. The most obvious are: - model performance - number of trainable parameters - training speed An example of training speed comparison could be the following ![](https://i.imgur.com/z59qtet.png) Try to evaluate different models along these dimensions and make a fair comparison of models. For example, if your fully connected model is much better than LSTM model, but it uses five times more parameters, the comparison is probably unfair. To get the number of trainable parameters in PyTorch you can use [pytorch-summary](https://github.com/sksq96/pytorch-summary) and Tensorflow keras you can run ```python #Keras model.summary() #Pytorch from torchsummary import summary summary(model, input_size=(x, y, z)) ``` ## Report & Source Code Perform the comparison of the baseline LSTM model, your custom LSTM model, and classical NN model (without recurrent connections). You have the freedom to add layers, increase the number of neurons, change the input data format. The primary goal is to beat the baseline LSTM model performance on the test set. The implementation should be in python using one of the deep learning frameworks used in labs (Pytorch, Tensorflow or Keras) and should use Tensorboard to log accuracy, model weights, F-measure and loss. The implementation repository should be available in GitHub or GitLab. Your repository should contain - Train and test script - Readme file (how to run the train & test script) - Documentation (code documentation and Readme) Proposed repository structure ``` ├── data <- Data files directory │ └── Data1 <- Dataset 1 directory │ ├── notebooks <- Notebooks for analysis and testing │ ├── eda <- EDA Notebooks directory for │ └── preprocessing <- Notebooks for Preprocessing │ ├── scripts <- Standalone scripts │ └── dataExtract.py <- Data Extraction script │ ├── src <- Code for use in this project. │ ├── train.py <- train script │ └── test.py <- model test script │ ├── tests <- Test cases │ └── dataLoad_tests.py ├── requirements.txt └── README.md ``` Your report should contain - Motivation, explanation of what a reader should expect from your report - Brief task definition and data description - If you use an alternative data input format, explain it - Comparison of LSTM model with different hyper-parameters. Describe which model is better based on the test and training set performance. Does the model overfit? Underfit? - Comparison of fully-connected model with different hyper-parameters. If you try different strategies to flatten the input of NN, describe better strategy based on the test and training set performance. Describe which model is better based on the test and training set performance. Does the model overfit? Underfit? - Use graphs and tables to document the results of your experiments The report should be submitted in PDF format.