PTC PRACTICAL DEEP LEARNING 2022-02

![](https://i.imgur.com/FICBHvt.png) ![](https://i.imgur.com/evE3GlG.png) # PTC PRACTICAL DEEP LEARNING 2022-02 **10.–11.2.2022** :::danger :bell: **Notes:** - Lecture recordings have been published. - The course certificates were sent out 14.2.2022. ::: --- ## :busts_in_silhouette: Lecturers: Markus Koskela: markus.koskela@csc.fi Mats Sjöberg: mats.sjoberg@csc.fi Katja Mankinen: katja.mankinen@csc.fi --- ## :link: Links - Course in PRACE training portal: https://events.prace-ri.eu/event/1311/ - Lecture slides: https://tinyurl.com/pdl-2022-02 - Exercises: https://github.com/csc-training/intro-to-dl/ - Day 1: https://github.com/csc-training/intro-to-dl/tree/master/day1 - Day 2: https://github.com/csc-training/intro-to-dl/tree/master/day2 - HackMD: - Program and lectures' discussions: https://hackmd.io/@pdl/rkT7St2KF - Day 1 exercises: https://hackmd.io/@pdl/rJgMGCMTY - Day 2 exercises: https://hackmd.io/@pdl/Bkbf7CMat - CSC Notebooks: https://notebooks.csc.fi/ --- ## :calendar: Program All times EET (UTC+2) ### Day 1: Notebooks | Time | Event | | -------- | -------- | | 9:00-10:30 | Course practicalities; **Lecture 1:** Introduction to deep learning (Markus) | 10:30-10:45| *Break* | 10:45-11:00| **Exercise 1:** Introduction to Notebooks, Keras fundamentals | | Jupyter notebook: 01-tf2-test-setup.ipynb | 11:00-11:30| **Lecture 2:** Multi-layer perceptron networks (Mats) | 11:30-12:00| **Exercise 2:** Classification with MLPs | | Jupyter notebook: 02-tf2-mnist-mlp.ipynb | | Optional: pytorch-mnist-mlp.ipynb, tf2-chd-mlp.ipynb | 12:00-13:00| *Lunch* | 13:00-14:00| **Lecture 3:** Image data, convolutional neural networks (Mats) | 14:00-14:30| **Exercise 3:** Image classification with CNNs | | Jupyter notebook: 03-tf2-mnist-cnn.ipynb | 14:30-14:45| *Break* | 14:45-15:30| **Lecture 4:** Text data, embeddings, recurrent neural networks (Markus) | 15:30-16:00| **Exercise 4:** Text sentiment classification with RNNs | | Jupyter notebooks: 04-tf2-imdb-rnn.ipynb | | Optional: tf2-imdb-cnn.ipynb, tf2-mnist-rnn.ipynb ### Day 2: Puhti | Time | Event | | -------- | -------- | | 9:00-10:15 | **Lecture 5:** Deep learning frameworks, GPUs and batch jobs (Mats) | 10:15-10:30| *Break* | 10:30-12:00| **Exercise 5:** Image classification: dogs vs. cats; traffic signs | 12:00-13:00| *Lunch* | 13:00-13:45| **Lecture 6:** Attention models (Katja) | 13:45-14:30| **Exercise 6:** Text categorization: 20 newsgroups | 14:30-14:45| *Break* | 14:45-15:30| **Lecture 7:** Using multiple GPUs (Markus) | 15:30-16:00| **Exercise 7:** Using multiple GPUs | extracurricular | **Lecture X:** AutoML | extracurricular | **Exercise X:** Hyperparameter optimization --- ## :interrobang: Questions and discussion :::info :pencil: Please add new questions and topics below all existing discussions. ::: Q: Warm-up question: What do you think is the most exciting area in deep learning or artificial intelligence? - GANs +1 - Physics-inspired Machine Learning/ Deep Learning - Explainable AI - Image understanding - Understanding molecular structures and interactions by deep learning - Nuero-inspired AI - Structure finding and classification - Graph learning in molecular discovery and biological pathway analysis. - Natural-language understanding, reading comprehension, GPT-4 Q: There are also inductive, deductive, transductive etc. can we say that these are also part of Main types of ML. ? - Those are also other types/divisions of ML, we just go through the main types here. Q: From a practical point of view, can we consider tensors as generic data structures that can be defined as containers of other objects (tensors)? - At least the way we are using the term in machine learning, tensors are always numerical data, i.e., a set of real numbers in one or more dimensions. Q: With only 1 additional point, wouldn't this gradient descent just be a random walk (no comparison of different possible directions / steps)? - I'm not quite sure what you mean. It's not random walk as it's based on the data points. - I meant that, in this instance, $\theta$ includes only one data point. To compute a gradient, the original data point is included, but there is only one gradient (only one possible direction), and if $\alpha$ is a constant, the walk will always proceed in that direction, independent of $\theta$ (and hence random). - I guess it's still not completely random, as it's based on another datapoint? All directions are not with equal probability, but it depends on the data. - I was specifically asking about $N=1$, which is only enough to define one derivative. - Oh you mean if we have $N=1$ data points in the whole dataset? :-) - I meant the case where $\theta$ includes only one (random) data point. The walk will then always proceed towards that data point, independent of the gradient, and hence be random. But I think the question is starting to be moot... Q: What does the stochastic/randomness and mini batches give you over normal gradient descent? - The idea is that you use a small batch of data instead of a single data point (traditional stochastic gradient descent), or the whole dataset ("regular" gradient descent) when calculating the gradient. The whole dataset might be millions of items. In practice the randomness often helps with not getting stuck in local minima in the optimization, using mini batches instead of just a single data point makes it bit more stable and helps to achieve global minima which is the final goal of learning. Apart from these modern optimizers are like rmsprop and ADAM are Q: Why do we use square in the loss function? - Problem with a loss function like sum|pred - actual| (taking absolute instead of squaring, in order to get positive values) is that it may not be differentiable everywhere (derivative won't exist at 0, for example). For higher order loss functions, there can be other local minima as well, while we'd like to get to the point at global minimum. Q: When should one use other metrics in the loss function than the standard one discussed in the lecture (say, other than mean squared error for regression)? Is there a way / rule-of-thumb for this? - It's hard to give any general answer to this. Depends so much on the problem you are trying to solve. Normally you can get pretty far with the standard ones. Often the problem is that the actual thing you want to optimize cannot be expressed in a differentiable loss function. - There are many possible loss functions, but selecting one of them is always problem-dependent. MSE is perhaps the default. Keras's regression losses are listed here: https://keras.io/api/losses/regression_losses/ Q: Is there anyway to find what the value should be for learning rate? - Usually the learning rate needs to be experimented with. If the learning seems slow, you can try a larger value and if the learning seems to be erratic and not stable, maybe LR is too large. It can also be changed during training. Often people start with a larger LR and then reduce it after some epochs - Modifying learning rate is called scheduling. For Keras, see https://keras.io/api/callbacks/learning_rate_scheduler/ Q: I suppose this dropout is for training data, where we can drop some connection. Can we do the same for Test set ? - Dropout is not usually done for test set, as we want to use the best model (which often means using all connections) Q: Is there a general rule for the number of epochs - what factors affect the optimum choice? - Not really, but you don't need to set the number of epochs beforehand. You can run it for some epochs, look at the current model and maybe run it for some epochs more. Or you can take a [snapshot](https://keras.io/api/callbacks/model_checkpoint/) of the model every Nth epoch. Also, ["early stopping"](https://keras.io/api/callbacks/early_stopping/) is a way to look at the training process and halt it if results seem to be degrading. - But doesn't that risk overfitting, i.e. adjusting fit parameters based on repeated attempts? Q: If the output of NN is a set of probabilities, how can we get other properties of the output, like variance? (in other words, if we change slightly the input, how the output changes) - Often this is not done, so by default it is not supported well. Of course you can evaluate slightly modified versions of the input and see whether the result changes. A principled way would be to use Bayesian neural networks but that goes over the scope of this course. Q: Data leakage from validation dataset. - Basically if you try 10s or 100s of models and pick some of them based on validation results, and continue trying out more models that are similar to the best ones and do this a number of rounds, you start to optimize your model based on the validation data and it may no longer be a good indicator of how your model would work on completely new data. Q: Does the dropout layer introduce new weights, which then are dropped, or does it work on the previous (probably) "Dense" layer? - On the previous layer weights. So it may temporarily cut a connection (set it to zero) but it does not change the weights in any other way. So, on the next minibatch, if that connection is not cut again, it will have the previous weights. Dropout can also be applied for other layers as well, such as the convolutional layers that Mats will introduce on this lecture. Q: but doesn't even this convolutional network detect edges and shapes that are based on reflections etc, not real objects? - Yes, it is detecting local features which could be caused by reflections etc. so with a single convolutional layer you cannot know the difference between real objects and shadows etc. With multiple layers detecting real objects might then be possible. Q: Within the convolution, do you assume some kind of kernel weighting for distance from the center, or is it constant? - ok, my question is just being answered (using $k$ different kernels)... Q: Are the color channels of the image always added together in the first feature map? - With a standard convolution layer yes, but there are other more advanced things like depth-wise convolution that processes each channel separately (https://medium.com/@zurister/depth-wise-convolution-and-depth-wise-separable-convolution-37346565d4ec) Q: Is there some kind of basic set of kernels that are always used (like basis functions)? - In deep learning no. The approach here is that everything that can be is learned, so fixed kernels (such as Gabor filters) are typically not used. Of course there is so much research happening in deep learning so somebody most likely has tried or uses fixed kernels. - So you start with random kernels? - Yes, they are all randomly initialized. Q: Is there a way to see the convolutions in use in a given layer? - Yes, for example in Keras you can access them with the get_weights() method. For example for the first model in exercise 3 you can do `model.layers[1].get_weights()` to get the weights of the second layer (which is the conv. layer). You get a bunch of numpy arrays... it's left as an exercise to the reader how to visualise those ;-) ``` # This is a nice visualisation of weights in the first layer: fig, axs = plt.subplots(4, 8, figsize=(6, 5), dpi=100) for i, ax in enumerate(fig.axes): ax.imshow( model.layers[1].get_weights()[0][:,:,:,i][:,:,0],cmap='inferno') ax.set_title(str(i+1)) ax.set_xticks([]) ax.set_yticks([]) ax.set_aspect('equal') ``` ![](https://i.imgur.com/X08n5XV.png) - Very nice :+1: Q: Is there a reason why the dimensions of the convolutions should be odd-numbered, e.g. $3 \times 3$, $5 \times 5$, (or even why they should be square)? - Odd: maybe as they are then more easily centered around the location of the neuron we are considering? Square: I guess not really, it's just easier that way, and no benefit from being non-square? - Square is the default but many other have been tried. The receptive field does not need to be fixed even, but rather you can learn the form of the kernel as well. This is however not addressed on this course :) Q: Does tensorflow automatically use multiple CPUs on the current machine for training? - In general it should, but I am not entirely sure about Notebooks. Q: I obtain 0.9792 accuracy on the train set after all epochs and 0.9881 on the test set. How can the test accuracy be higher then during training? - There is always some randomness involved. Probably it just so happens that your model gets more digits correctly predicted by chance in the test set. Although the difference is rather large :thinking_face: Q: is the word similarity rather in pronunciation or meaning? Well, guess this may be both ways, depending on what we want to do like classify text content or do voice recognition. - Yes, in this case it refers to meaning. In practice we typically mean words used in similar context (i.e., with the same words before or after). But of course if we are processing the audio signal, it could refer to other things as well... Q: What is a bag of words? - Representing text in a numerical form, where indices correspond to different words and how many times they appear in the text (as an example). For example, a sentence "hello world! World is big" could be represented as a bag-of-word like [1, 2, 1, 1] : "hello" appears once, "world" twice, and "is" and "big" once, respectively. One could also use something else as a value instead of the word counts. The word order doesn't matter here. Q: Presumably, the dimensionality of the relationships between words (and the proximity of pairs) is much higher than two-diemnsional. Is such a 2d-mapping some kind of principal component projection? - Yes, the example on the slides is just to visualize relationships, but in practice these are in much higher dimensions (can be hundreds). The projection can be done using PCA, t-SNE or UMAP, for example. - You can play with such projections for example here: https://projector.tensorflow.org/ Q: If I want to use CNNs on time series, I could also produce embeddings of the time series $(x_t, x_{t-\tau}, ..., x_{t-n\tau})$, so I get embedded vectors similar to the word embeddings. Do people do something like that? Or should one rather account for the time correlation by using scalar inputs $x_t$ with RNNs/LSTMs? - I don't quite understand the question. The embedding typically doesn't contain time series information, but is rather some kind of vector representation of the individual data point (e.g., the single word), and CNN and RNN take care of correlating things over time. - I mean, if I had a scalar time series instead of text, how can I do forecasting with CNN? One idea would be to produced embedded vectors $(x_t, x_{t-\tau}, ..., x_{t-n\tau})$ and then apply he CNN, because simply applying a CNN to the scalar time series would not account for time correlations. I was just wondering whether that is done by anyone. - I guess you could apply the CNN directly to the scalar time series? It's fine if you "embedding" is a single dimension. Not sure if it's the best method, but... Interestingly, the example texts in the movie example both score negative - but that seems to be because they are too short (maybe short reviews which will be padded tend to be generally bad?). Duplicating the "positive" review text means the review is classed as positive. An empty string is negative, a string with 80 copies of "a" is positive. Q: Wondering if there is any success using sequence learning models in the area of biological data (DNA sequences, for example) - Yes, we have actually had a project about DNA sequence classification. We used CNNs however for that. Our approach was pretty similar to this: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2182-6/figures/1 Q: Estimating the memory requirements is already a bit challenging with normal batch jobs. This probably becomes even more difficult now? - Yes, for GPU nodes in practice the memory and cpus are in some sense secondary. GPUs are the expensive resource so it is a good idea often to ask for 1/4 of memory and cpus when you use a single GPU (as there are 4 of them per node). So for example 64-96 GB and 10 cpus in Puhti. Q: What elements need to be tuned to prevent I/O from being an problem for deep learning applications? are elements of the system? of the benchmark? of the application? - In particular for shared file systems, like the one in the Supercomputer, you shouldn't do lots of small reads, like reading individual image files. This can be fixed by using better data formats or using the local drive as discussed in the lecture. The other thing is that you need to have enough CPU resources to process the input so that the processing doesn't become a bottle-neck keeping the GPUs waiting. This will be discussed a bit in Lecture 7 this afternoon. Q: These examples on slides are hugely complicated. Are there public resources on how such networks are designed to match the application? - I guess the main resource would be to refer to the scientific publication for each network. These have all been published in scientific venues. For example https://paperswithcode.com/ is a good site to find the best results for various benchmarks and their corresponding papers (and code!). Q: In `tf2-dvc-cnn-evaluate.py` there are these lines: ```[python] with keras.utils.custom_object_scope(custom_objects): model = keras.models.load_model(sys.argv[1]) ``` Q: Why do you need this custom scope and how is it working? - It is a kind of hack to support the BigTransfer network with the same evaluation script. Basically BiT comes from Tensorflow Hub and uses a generic "KerasLayer" which Keras does not know about. With simple CNN and VGG16 you don't need it. Q: What is the best practice to evaluate the need of resources (time, N-GPU, etc) for the main batch from the tests on gputest partition? - Often other things besides running time (as it is limited to 15 mins) can be first tested in gputest. 1 GPU is usually the starting point or the baseline. If you then want to try out using multiple GPUs, you should compare the performance to the 1-GPU runs to see whether you get any performance gains with >1 GPUs. That's not always the case. Also, if you run out of memory, that often happens during the first 15 mins, so you see it already in gputest runs. Q: How to use tensorboard via SSH port forwarding? - Load module tensorflow/2.4-hvd instead, so: ``` module purge module load tensorflow/2.4-hvd tensorboard --logdir=intro-to-dl/day2/logs --port=PORT ``` Q: I guess the bottom line from dogs vs.cat and traffic signs tasks is that before using a pre-trained database we should check if suitable for the purpose. How we can access those databases and check the images there? Can you tell something about it? - Yes, often using a pretrained network is a good idea. For example networks trained with imagenet work surprisingly well for many tasks. Not always though, as we saw with traffic signs, which consist of rather simple geometric shapes (rectangles, circles) and thus can be recognized quite well with a simple CNN trained from scratch. The prepackaged networks in Keras are all, I think, trained with imagenet (https://www.image-net.org/). Q: About Ray Mooney citation: isn't a sentence a vector? I mean, it's a list of items that can be represented as numbers in a certain order. So is it more of a problem in how neural networks "understand" vectors rather than the classification of the input data type? Relevance: we just need to develop the methodology ;) - Yes, this is true. And in general, sentence embeddings are at the moment a very hot topic in deep NLP and they exactly represent sentences with vectors :) (See e.g. https://www.sbert.net/) Q: this attention on different parts of sentence is interesting, but another interesting thing is how the different parts are detected in the first place. There is obviously a lot "under the hood" that's way out of scope here. - Yes, there is a lot of details indeed. One blog I would recommend is https://jalammar.github.io/illustrated-transformer/ and the original "Attention is all you need" paper is actually also quite easy to read and can be recommended - This is great, just a quick glimpse and it looks like a very valuable additional reading - Yes! And understanding how those different parts are detected is a very active research topic in NLP. For example https://aclanthology.org/W19-4828/ (2019) + many more recent ones. - By the way, one nice resource for visualizing attention weights is here: https://github.com/jessevig/bertviz A cool application is also https://openai.com/blog/openai-codex/ - Yes it is, and https://copilot.github.com/ is an integration of codex with a regular code editor such as vscode - https://www.tabnine.com/ Q: Is there any prunning applied to the words for embedding? - I guess words not available in the GloVe embeddings are simply dropped. So pruned in that sense. (If you refer to the exercise.) - One pruning step is the `tokenizer = text.Tokenizer(num_words=MAX_NUM_WORDS)` line, as we only use `MAX_NUM_WORDS = 10000` most frequent tokens Q: I got significant difference in performance between Keras (cnn 95% acc) and PyTorch (cnn 70%) for both CNN and RNN. What is that about? - There are plenty of small implementation differences between PyTorch and Keras, so one cannot expect to always get the same result. In this case we should spend some more time to see if we're doing something wrong in the PyTorch case if the difference is that big. Q: Can transformers can be used one chemical data like on their SMILES representation? - I am not an expert on chemical data, but apparently it has been done, see e.g. https://arxiv.org/abs/1911.04738 (and probably other papers as well, such as https://par.nsf.gov/servlets/purl/10168888) In general, transformers have been tried in really many different domains and applications. (But they may not always be "the optimal solution" - it really depends on the problem, field of science and all that.) For the curious: in addition to models (https://huggingface.co/models) and datasets, HuggingFace also has "spaces" for various ML apps https://huggingface.co/spaces