owned this note
owned this note
Published
Linked with GitHub
# Introduction to Data Science, ML and Deep Learning  Deeplearning540h
Mar 1  3, 2021 @ 2nd Terascale School for Machine Learning, https://indico.desy.de/event/28296/
HackMD Document for the entire course.
**Important:** Please be constructive, inclusive and positive in your communication with your peers.
## House Keeping
Some important links to not get lost:
Codeofconduct for this entire school:
https://indico.desy.de/event/28296/attachments/64388/7910/Code_of_Conduct_Terascale.pdf
Please let us  the organizers  know, of any events that fall into the codeofconduct.
### Video Conferencing
 the main zoom room is here: https://cern.zoom.us/j/66120916180?pwd=aWtSVWdUNFFXV1FFSFQ4MEFsK1RlQT09
 the team zoom room:
 team 1: Erik Buhmann, David Brunner (replacing Erik on Tuesday): https://unihamburg.zoom.us/j/97668249577?pwd=NWZ5blVkdnJ0TEl1NEZFRzBNSzNFZz09
 team 2: Sascha Diefenbacher: https://unihamburg.zoom.us/j/98544597848
 team 3: Manuel Sommerhalder: https://unihamburg.zoom.us/j/96330719414?pwd=dTB6c0NZM2tOVlVXU2NuajEySUJSQT09
 team 4: Mykyta Shchedrolosiev: https://cern.zoom.us/j/3594405953?pwd=TDRMMmZvczJYbG54dWdnaTBWbS8wZz09
 team 5: William Korcari, David Brunner (replacing William on Monday) :https://cern.zoom.us/j/63832089462?pwd=S3FrZE9udEJYaFhmUE51cUp2YjF3QT09
 team 6: Tobias Lösche: https://unihamburg.zoom.us/j/96900076199?pwd=R0wzdiswTmZPTGx2UG8wQmlsTFBuUT09
 team 7: Jonas Rübenach: https://cern.zoom.us/j/61741701178?pwd=djYyRTR3K2RQSEMxbE1HNG5yNTljUT09
 team 8: Oleg Filatov: https://cern.zoom.us/j/6451659169?pwd=QU4wNHVmV1FQcmxZN2hwYmlwSFJQdz09
 team 9: Lucas Wiens: https://cern.zoom.us/j/68506160870?pwd=Y3RoUXBhZnRob2VCTlNaTlBqdkt5UT09
 team 10: Moritz Scham:
 new zoom Room: https://cern.zoom.us/j/3407172253?pwd=RVNETnpnMkpkMGxReWZzdExQUTRtUT09

### Staying in Touch
 the main mattermost channel: https://mattermost.web.cern.chttps://mattermost.web.cern.ch/signup_user_complete/?id=j93uppzm6ff9zg5brdeuqobfgwh/signup_user_complete/?id=j93uppzm6ff9zg5brdeuqobfgw
 team mattermost channels:
 team 1: Erik Buhmann, David Brunner https://mattermost.web.cern.ch/terascaleml/channels/group1
 team 2: Sascha Diefenbacher, https://mattermost.web.cern.ch/terascaleml/channels/group2
 team 3: Manuel Sommerhalder, https://mattermost.web.cern.ch/terascaleml/channels/group3
 team 4: Mykyta Shchedrolosiev, https://mattermost.web.cern.ch/terascaleml/channels/group4
 team 5: William Korcari, https://mattermost.web.cern.ch/terascaleml/channels/group5
 team 6: Tobias Lösche, https://mattermost.web.cern.ch/terascaleml/channels/grup6
 team 7: Jonas Rübenach, https://mattermost.web.cern.ch/terascaleml/channels/group7
 team 8: Oleg Filatov, https://mattermost.web.cern.ch/terascaleml/channels/group8
 team 9: Lucas Wiens, https://mattermost.web.cern.ch/terascaleml/channels/group9
 team 10: Moritz Scham, https://mattermost.web.cern.ch/terascaleml/channels/group10
### Documenting Learning
Each team will have an individual hackmd document where they can collect notes and structure their learning. Here is a list of hackmd pads:
 team 1: Erik Buhmann, David Brunner https://hackmd.io/@2Of0NO2QOuLytwFvbLLew/SyJQVqtzO/edit
 team 2: Sascha Diefenbacher, https://hackmd.io/@dYF1ZNaQQK6xeJntCK6kUA/rkVjaKUfu/edit
 team 3: Manuel Sommerhalder, https://hackmd.io/qS2jfxpR9OoCXftYOUlOA?edit
 team 4: Mykyta Shchedrolosiev, https://hackmd.io/E6I1L96OQwSgTkyONCcMdw?edit
 team 5: William Korcari, https://hackmd.io/@Z3kIRVbRJuDU0MYfqzw/rJzE6f9fO/edit
 team 6: Tobias Lösche, https://hackmd.io/@loeschet/rkjPUMYGu/edit
 team 7: Jonas Rübenach, https://hackmd.io/@JKI5JOhTG2m3ODr0QvHfw/Bk0KQTUfu/edit
 team 8: Oleg Filatov, https://hackmd.io/@tPZTWWavRMW6mtFiIrMjjA/H1m4DGtMd/edit
 team 9: Lucas Wiens, https://hackmd.io/@Q580EphDQTeemvMMuNrpg/rkM1hEFz_
 team 10: Moritz Scham, https://hackmd.io/@lEufmOFkRq7vjxuKEsn_Q/B1Z_LOIzu/edit
## Learning
Each lesson always follows the same structure and is expected to last about 1h. The teams need to selforganize the flow of the lessons.
1. learners watch the video :cinema:
2. learners answer at least one checkyourlearning questions as a team (at best in a hackmd document) :heavy_check_mark:
3. learners dive into the exercise on their own if time permits :clock1:
:question: Instructors help with show stoppers like syntax errors where they can.
:computer: if you like to conduct the exercises, or code along during the videos, we suggest to use [google colab](colab.research.google.com/). Note, you may need a google account for this.
Each lesson has a jupyter notebook, that is half filled. The video lectures start from this notebook and provide content to fill in.
## Lessons
 Lesson 00: Preface
 Lesson 01: Diving into Regression [video](https://indico.desy.de/event/28296/contributions/99576/attachments/64395/79079/deeplearning540lesson0120210219_17.59.48.mkv), [learner notebook](https://github.com/deeplearning540/lesson01/blob/main/lesson.ipynb)
 Lesson 02: Enter Clustering [video](https://indico.desy.de/event/28296/contributions/97975/attachments/64396/79084/deeplearning540_lesson0220210222_23.30.44.mkv), [learner notebook](https://github.com/deeplearning540/lesson02/blob/main/lesson.ipynb)
 Lesson 03: From Clustering To Classification [video 1](https://indico.desy.de/event/28296/contributions/97976/attachments/64398/79089/deeplearning540_lesson0320210223_23.14.33_part1.mkv), [video 2](https://indico.desy.de/event/28296/contributions/97976/attachments/64398/79097/deeplearning540_lesson03_part220210226_22.37.04.mkv), [learner notebook](https://github.com/deeplearning540/lesson03/blob/main/lesson.ipynb)
 Lesson 04: Classification Performance ROCs [video](https://indico.desy.de/event/28296/contributions/97977/attachments/64400/79098/deeplearning540_lesson0420210224_18.09.02.mkv), [learner notebook](https://github.com/deeplearning540/lesson04/blob/main/lesson.ipynb)
 Lesson 05: Neural Networks as Code [video](https://indico.desy.de/event/28296/contributions/97977/attachments/64400/79101/deeplearning540_lesson0520210225_17.48.08.mkv), [learner notebook](https://github.com/deeplearning540/lesson05/blob/main/lesson.ipynb)
 Lesson 06: How did we train `video`, no jupyter notebook for this lesson
 Lesson 07: CNNs `video`, [learner notebook](https://github.com/deeplearning540/lesson06/blob/main/lesson.ipynb)
 Lesson 08: Deep Learning `video`, [learner notebook](https://github.com/deeplearning540/lesson07/blob/main/lesson.ipynb)
1.

# Questions from the video lessons
## day 1
 lesson 1: Is the forward fill the correct method for preprocessing NaNs for linear regression? Shouldn't we use e.g. ".dropna()" to drop the rows containing NaNs?
 general answer: it depends
 in some situations, `dropna` is the way to go
 `fillna` belongs to a family of methods which are commonly refered to as imputation, which are commonly refered to as imputation
 lesson 1: You mentioned that a chi^2 ~ 0 is an indicator for a good fit. Isn't that more of an indication that we've overestimated the errors on the training data? Is there a way to correctly adress this when the errors on training data are unknown?
 in our case somewhat  as our assumptions that must hold when using chi2 (data is distributed normally, i.e. mean = 0, variance = 1) around the prediction
 You can also use [R2](https://en.wikipedia.org/wiki/Coefficient_of_determination) which is typically done with linear regression, but in this situation, I don't think it adds too much information.
 as soon as you have a model of the process that generated your data, the standard procedures like MLE and LS fits apply so do the goodnessoffit tests then
 lesson 1: why did you divide by N2 (in regards to DOF) and not by N1?
 because we are doing linear regression, so we fit `y = w*x+b`, ergo we have 2 free parameters `w,b`
 so we calculate `N` number of datapoints minus `2` parameters to obtain the DOF
 lesson 2: if we have two unsupervised clustering algorithms and no truth labels, how can we determine which one is better? (without much domaine knowledge)
 use different metrics from your domain to assess correctness
 lesson 2: Are there also ML clustering algorithms that don't rely on an apriori number of clusters?
 yes, https://scikitlearn.org/stable/modules/generated/dbscanfunction.html#sklearn.cluster.dbscan
 there are likely more
 lesson 2: naive question, what could be an example of reinforcement learning? (apologize if that was explained in the video)di
 controlling or driving selfdriving cars
 learning how to play a (video) game (super mario, star craft)
 lesson 1: Is there any reason to do a forward fill instead of a backward fill?
 no there is no reason, in this case you could have done backward fill as well
 I used forwardfill to show that something like this exists
 lesson 1: In what sense did we do 'machine learning'? Such a linear regression can also be done by a human by hand.
 excellent question, linear regression was used so that people get used to the `fit`/`predict` incantations that are typically done with modern ML frameworks
 lesson 1: (misplaced in mattermost) Regarding the estimation of "ice size" in the different months from the official data, the procedure doesn't seem to make much sense. The minimizer logic is one of "least squares" between the linear model and the data. But the predicted values obviously lack any error estimation; which of course follows from the lack of errors from the input. From a physics/scientific point of view there is very little to be gained. Is this meant as a general exercise in "how would you do that in principle?" Peter mentioned a "ChiSquare alike" number to be computed, which is misleading. He also mentioned the assumption of normally distributed (probably homoscedastic) measurement values; but the actual size is obviously missing
 typically the main question here is, what do you do if you don't have a mechanistic model that describes your observation? Many people start with a linear model. Hence I did so too.
 in this case, I didn't mean chi2 is misleading itself but mostly with respect to the expectation that it should be `1` (maybe I didn't understand your comment correctly). The MSE is the way to go (as it is part of the Chi2). One needs to be aware of the shortcoming of chi2 in this situation though.
 You can also use [R2](https://en.wikipedia.org/wiki/Coefficient_of_determination) which is typically done with linear regression, but in this situation, I don't think it adds too much information.
 what are usual python methods to find NaNs in data other than printouts?
 the dataframe API in pandas has builtin methods for this `.isna()`
 df.isna().sum()
 lesson 2 exercice:
I tried to fit the iris data using more than just the petal_length vs. petal_width but found that not only did that not necessarily lead to better performance (which one might expect) but it led to inversions in iris species altogether ! This clearly has to do with correlations but could you say/write a couple words ?
plots: https://imgur.com/a/OoQ0rtb
 so by fitting you mean the `fit` method of kMeansClustering?
## day 2
general question: can you make the names of the notebooks in google collab to have different names for each lesson for example lesson1,2,3....ipynb instead of just lesson.ipynb. Like this it might be a bit easier for the people, because otherwise if you forget to rename it yourself you might end up with several notebooks with the same name copied to your google drive and the same is valid for the scripts. Thanks! :)
lesson 3 questions:
 for the prediction with kNN, and with n_neighbors=3, what is the predicted classification when you have one penguin of each of the three classes as the three nearest neighbours?
 for a tie the `sklearn` method will classify the point depending on the order of your data
 so, it is more or less random
 [here](https://stats.stackexchange.com/questions/144718/howdoesscikitlearnresolvetiesintheknnclassification) is a nice explanation for this
 to break a tie it is recommended to decrease k 1 at a time until it is fixed (see also [this post](https://stats.stackexchange.com/questions/45580/dealingwithtiesweightsandvotinginknn))
 another option would be to take the distances into account and choose the nearest neighbors
 Is the descission threshold to be understood as the decission boundary? Or a threshold as in e.g. onevsrest logistic regression? (Is it a boundary in paramter space (x_1,x_2) or a single value?)
![](https://i.imgur.com/uLWzpYA.png)
 I don't know if I got what you mean, but here the decision threshold is the threshold to decide between the classes, a combination of all relevant parameters
 you do not have a threshold for each parameter on it's own, or else you would need only one parameter for the decision
 in the image you pasted, the decision threshold produces a decision boundary in the parameter space `x_1` vs `x_2`, the image maps this onto a very general concept of a variable; in a later video I map this onto the probability to predict the positive class  which may be more intuitive
 what is the official reason for selecting odd k as in the checkyourlearning question 2? We first thought it might prevent ties in the majority vote but given k=5 and three classes, a tie could look like 2:2:1
 oh, official reasons are tricky to obtain ;)
 so first of all, the kNN implementation in sklearn has a mechanism built in that it decides randomly when facing a tie  so a random class label is chosen between the "two" classes that have produced a tie
 again, it depends on your dataset and how well the clusters are separated  this is a clear shortcoming of this rather simple method that ties can be produced and are hard to be mitigated
 however, choosing k being odd, you can at least prevent this from happening in a binary classification  for multiple categories, the situation becomes more fuzzy and unpredictable
 Note: There is a lttle confusion in True negatives definition in Confusion matrix section in video
 my apologies for this, I recall somewhat that I was irritated myself during recording it; I did not have time to rerecord it
 Lesson 3 question: Why should (or is this even possible?) I use the ROC curve for finding the quality my kNNclustering, if it does not include the false negative classified points? Sorry, if this question doesnt make sense.
 thank you for asking, the ROC curve reports the performance of your classification in the True Positive Rate versus False Positive Rate plane
 $TPR = \frac{TP}{TP+FN}$
 $FPR = \frac{FP}{FP+TN}$
 so it does take into account the false negatives `FN`, but perhaps not in pronounced fashion
 Lesson 3 question: for the KNeighborsClassifier class, what is the reason to pass the n_neighbors in the constructor, and not use n_neighbors as an argument of the ```predict()``` method? I.e. does the ```.fit()``` method depend on any way on the n_neighbors? It feels unnecessary to construct a whole new knn object when you just want to try alternative ```k```.
 good question, I haven't looked it up how they do it. In principle, a tree should be built that memorizes the dataset and puts it into RAM with the respective labels for the nearest neigbours, and during the predict method it should be much faster.
 Otherwise you would have to filter in a linear method, which would make it much slower.
 The scikit learn authors are very focused on keeping the API clean, this could also be a reason.
 Lesson 4 question: Is it correct to think about the ROC curve as being parametrized by k? (k being the number of neigbours one looks at for building the probabilities, as before)
 good question, I would first of all wonder what you mean by parametrized? the ROC curve is for sure affected very strongly by `k`
 the ROC curve itself is calculated from the ground truth information in the test set and the positive prediction scores; the latter being greatly affected by the choice of `k`
 Lesson 4 question: The `roc_curve` method, does it just scan over the threshold for the probability and calculate the fpr and tpr for each cut, or is there more to it? (And how are the used thresholds points selected?)
 a somewhat nice discussion on the positive predicted probabilities can be found here, https://stackabuse.com/understandingroccurveswithpython/ (you also find a link to this page at the end of the lesson script)
 the procedure to obtain a ROC curve is described in great detail in https://www.math.ucdavis.edu/~saito/data/roc/fawcettroc.pdf
 so the roc_curve does take the true labels of the test set `y_test` and the positive class probabilities `y_prob = kmeans.predict_proba(X_test)[:, 1]` as input
 from these two inputs you first sort the entries of `y_prob` by probability, given `y_test` you know that any cut e.g. `>= .9` fixes the labels of the prediction to a certain class label by filtering `y_prob` accordingly, i.e. all entries of `y_prob >= .9` are predicted as true (all the rest will be predicted as false); then you can calculate TPR and FPR from this information; you repeat this procedure by moving on to the next threshold `>= 0.85` and so on
 how the thresholds are selected with `sklearn` is beyond my knowledge at this point, I can guesstimate that they start from tpr == fpr == 0 and move towards tpr == fpr == 1; then they only record those probabilities when/if tpr or fpr changes with respect to the last value > this is how I'd do it, would need to check the sklearn source
 Lesson 5 question: You mentioned overtraining and mentioned the number of parameters of our neural network being too large. Would using a fewer number of neurons prevent this?
 Yes, this would reduce the number of parameters and therefore mitigate this problem. (more in lesson 7)
 Lesson 5 question: Can you elaborate a bit on the "loss" (in the keras output table)? This was not so clear to me. (+1 on this question, did not understand what this means at all.)
 The loss is discussed in lesson 6.
## day 3
Questions to Lesson 6:
 What about a varying learning rate that is batch or epoch dependent? Maybe it could be informed by the Hessian of the loss function somehow?
 there are various schemes to adapt the learning rate: depending on the epoch you are in, depending on how the loss changed for the last n epochs, ...; see https://keras.io/api/optimizers/learning_rate_schedules/
 I am not sure about your idea with the Hessian as you'd have to invert a gigantic matrix for this and there are hardly any guarantees on this matrix as it depends on your input data, doesn't it?
 How are the weights initialized? The convergence to a specific minimum might depend strongly on this as the dimension of the space is very large.
 the weights are randomly initialized, there are several schemes https://keras.io/api/layers/initializers/#layerweightinitializers
 typically people try out 23 if their training get's stuck to exclude this as a possible source of problems
 to be honest though, I have not met any project where this plays a role and would be curious to learn of circumstances where it did
 Regarding "outputs are (0,1)": what do you refer to as outputs here? O_{1} and O_{2}, or P_{O_1} and P_{O_2}?
![](https://i.imgur.com/LsNvdIg.png)
 I am not sure I understand the question (feel free to reformulate)
 the note on the slide was referring to the fact that the softmax layer produces probabilities as the output
 Is it just a matter of taste to use, say, ReLU (all but last layer) + Softmax (last layer) as activation functions instead of sigmoid for all activations in all layers including the last one?
 mostly, I'd say yes  relu is (for me) a good starting point
 there are always papers appearing that suggest different/new activation functions  however as there is (yet) no rigorous theory how the learning works in detail, it is a matter of trying things out
 you should be cautious though to avoid the [vanishing gradient problem](https://en.wikipedia.org/wiki/Vanishing_gradient_problem) and other pathologies; this occurs if the derivative of your activation function allows values between `[0,1]` only and as such small weights/gradients can be experience exponential decay when moving through the network
 You explicitely draw the bias units. In other introductions to ML, I have seen these to be always set to "1" with no weights like $W_{00}^{l}$ from one bias unit to the next. Does it make any difference?
![](https://i.imgur.com/9GWvmR8.png)
 I think it does make a difference and should be mentioned during teaching
 as you say, many introductions set the bias terms to a constant or don't mention them; I believe this is a somewhat unfortunate choice because when you want to calculate the number of parameters of a layer, having the bias terms or not is important (also for applying your knowledge in practice)
 The concept of divergence (as a method to measure the similarity of two PDFs) was/is new to me (and it fell from the sky in the lesson). If it is important, please take time to explain it. If it is not important in the context of an introduction, maybe leave it out :D
 good point, I considered it essential as I experienced the same as you
 this motivated me to mention it as the ML literature today draws on this principle very often
 Maybe it is just me, but "Categorical CrossEntropy" might also be new to people taking a course on introduction to ML (and might need more explanation in the lecture!). (but I understand that time is limited and you can only explain a limited amount of content)
 I am not sure I see a question here, but let me comment anyhow: you touch upon a very critical point, namely to understand what a loss function does. Effectively what a ML practitioner has to do is understand what the expectation value of your loss is and what it means in light of your data
 for classification, the loss is bound by `0`, this can be super important when debugging a network
 Related to the above input: can you give some conceptual (rather than "works good in practice"; edit: didn't want to imply you put it that way :) ) motivation of why the categorical crossentropy has this shape? E.g. why does it use the KL divergence?
 I am not sure what you mean: the crossentropy has a clear and rigorous definition and derivation within its own right; if I referred to it as 'works good in practice' then I'd like to correct myself
 by mentioning the KL, I wanted to share a more global view of the mathematical landscape, so that learners get the feeling, there is more!
 there are tons of ways to compare PDFs for continuous variables as well as discrete ones; you will be surprised how diverse the ML literature is and does NOT stick to one and one only loss
 for classification, I suggest to start with crossentropy (always)
Questions to Lesson 7 part1:
 Say, I'd like to train a CNN to enhance certain features of an image, e.g. the boundaries of an oject against its background. How does the CNN measure the "enhancement" of the boundaries in order to train for that? (maybe this will be explained in part 2 :) )
 great question, and I can only say that this is for you to decide
 choosing the loss function is about describing the task you want to fulfill in one single number with respect to the ground truth labels (supervised learning)
 for regression problems, the loss is typically the MSE (L2 norm), because you want your prediction as close as possible to the ground truth (other losses like the L1 norm are also used)  but again, this depends on your task
 coming back to your question: it depends how you represent the boundaries of the objects you are referring to; in the simplest case, you may express this by using a binary mask as ground truth image with the object boundaries set to `1` and all other pixels set to `0` and then predict whether each pixel is of class `1` or `0`
 the goto architecture IMHO is the [Unet](https://en.wikipedia.org/wiki/UNet) for such tasks
 to "improve" resolution aka for superresolution, you may want to use a totally different model: https://youtu.be/wmQIcTOzH0k
Questions to Lesson 7 part2:
 How does a training look, where there is no overfitting/underfitting? Is the goal that both, training and test data remain at a constant loss around the same epoch? I played around with the nr of parameters and I can't get a parameter set with no overfitting, without reducing the accuracy on the testdata.
 great question and comments
 due to time constraints I couldn't dive into this aspect deeper
 a good training behaves as such that the loss of training/test always run in parallel (excluding fluctuations of the training loss)
 once the test loss levels off and becomes constant while the training loss keeps decreasing, you know that you are done training
 playing with the nr of parameters is one technique to tackle this
 another one is using [dropout](https://keras.io/api/layers/regularization_layers/dropout/) layers which do the following:
```
The Dropout layer randomly sets input units to 0 with a frequency of rate at each step during training time, which helps prevent overfitting. Inputs not set to 0 are scaled up by 1/(1  rate) such that the sum over all inputs is unchanged.
```
more on lesson 7:
 Is convolution in 1D useful?
 Yes, but it depends on your dataset. Always helpful if you try to describe neighbor information (like in images). If the input is very spiky, it might be helpful, but if the signal is like a spike, it might not be useful. Genome information is probably not working.
 Is fourier transformation useful in image processing?
 Yes.
 Can you train the training of neural networks? Instead of changing weights you change parameters like number of layers, nodes etc.
 Yes, it is called Neural Architecture Search (NAS). If you have a lot of computing power...
 Sometimes you specify "axis=1", why?
 This is related to numpy and how it allows you to perform transformation. It is confusing for beginners.
 if I do `x_inputs = layers.Input(shape=(3,))` and then `x = dense1(x_inputs)`, what is the data type of the `x_inputs` and `x` variables? (For me the answer doesn't have to be very rigorous, just want to have an idea. ^^)
 We are constructing these computational graphs. The input coming from the data is handed to the dense layer and you start the pearl chain, and with this you start this chain of layers.
 (even more general question) a trained network (for the ones that we used) can just be written down as an analytical function, right?
 Right
 (Just chaining the linear and nonlinear functions.) So it should be fairly easy to get the parameters of the trained network and calculate the model prediction 'by hand'?
 ML is trying to optimize the function that can represent the data. However, the funtion you obtain is very complicated. (Link from astrophysics: https://youtu.be/wmQIcTOzH0k)
 is there a way to visualize what each layer of the CNN does with an image? ie. show the image at every step?
 Yes, several. Peter will look for some and update the page here. But doesn't look so satisfactory.
 Does it make sense to use a CNN for a task where the inputs don't have a perceivable 2D pictorial representation (e.g. the input is an arbitrary number of properties of distinct nature)? The input is arbitrary pp event collision properties. The output is the class of the event
 Not sure, for CNN maybe not but maybe for GNN, depends on your output.

Questions/Comments/Results to the capstone project:
 ![](https://i.imgur.com/SIsCNcv.png)
 Why is the area under the ROC curve a good performance metric?
 Well defined, and you want to have the best classifier on the top right. But there are other methods like the precision recall curve, which might in some cases be more useful.
 Why is processing the features as images performing so much better than processing the features in tabulated format?
 Squeezing the information in tabular format restricts the input. Dense layers have no notion of what are the different variables (pt, ...) , however if you go to an image classification, the dataset is easier to interpret.
 (EB) Comparison study of different architectures and data representations for the Top Tagging dataset: https://arxiv.org/abs/1902.09914
 how do I get the data file into the collab notebook?
 You can do from within the collab a wget by starting with an exclamation mark, e.g.:
`!wget https://desycloud.desy.de/index.php/s/xip274LKzzypRba/download O data_train_40k.h5 c`
 I do not know a solution for direct reading. I could imagine that it is possible
 I add to the previous question, is there a possibility of not downloading them but using the dataset as well (as we did using csv files in the previous two days)?
 I don't think so.
 if you managed to run models for the capstone project, please report your accuracies below
 for the capstone project 2, the data file seems to be too large, my colab notebook crashes due to not enough RAM, when `store_train_full = pandas.HDFStore("/content/train_img.h5")`
 was resolved, see Mattermost Townsqaure: `df_train_full = store_train_full.select("table")` is actually the problem.
 Was the case for me too, I had to download the data and to run the jupyter notebook on my laptop, which in the end worked, leading to areas under the ROC curve 0.9055 for Model 1 and 0.9194 for Model 2 (CNN) (I didn't do any modifications).
 Some of the keys of the history object had to be edited slightly in my case: history['acc'] > history['accuracy'] (same for 'val_acc')
 by increasing the training data from 10k>100k and validation data from 2k>4k, epoches 5>6 Model 1: 0.9236
 for Model 2 in addition: increasing filters 8>64, adding Dense(100, activation='relu') after MaxPooling and Dense(50, activation='relu') after flatten, epochs 5>4 (heavy overtiffting otherwise) => ~2M parameters: 0.9660
==============================
# Thursday, 4th March
Questions to JeanRoch's lecture:
 What would you exactly mean with convex and non convex optmization? Could you give me examples? That was enough, thank you :D
 Slide 12, bottom left. How does a network without input and/or output work? How is it useful?
 Would you put more work in refining the network architecture or put in intentionally too many parameters and regularize it afterwards?
 How do you make your network be aware of external properties, such as symmetries or QCD? How is this knowledge put into the network?
 How would you implement Pooling for Graph Convolutional Neural Networks?
 Are graph neural networks efficient on regular 2D graphs? Stock market data for example.
 Probably very example dependent, but in slide 72, what could be the criteria for a "fitted network", and what is meant "algorithm speaking" by breeding and mutations?
===============================
# Friday, 5th March
Questions to Jonas' lecture:
 Slide 14: is the distribution $p$ of the noise known?
 GAN (slide 17)  is the discriminator/the adverse network trained via supervised learning? (The discriminator hence needs data that is known (labelled?) to be real paintings?)
 Yes.
 Speaking of art forgery: have you ever heard of a case where discriminators were used to classify real life art as to whether it is an original or not?
 We would need labelled data (fake vs real art). Maybe some clustering would work.
 So sftmax should be the way to go for the generator output, right?
 in the 1D case softmax is similar to sigmoid, but correct
 why is the BatchNormalization layer set to trainable?
 In principle there is a problem with the infering mode of BN in the keras software
Checkerboard patterns in GANs: https://distill.pub/2016/deconvcheckerboard/
 Question part 2: If the Wasserstein meassure only works for disjoint distributions, won't that become a problem once our predicted distribution starts overlapping with the real one?
 I think you said that for one layer in the critic it was obvious that we didn't need an activation function. Why don't we need one?
 `x = layers.LeakyReLU()(x)`
Questions to David Shih's tutorial:
 The encoding dimension should be smaller than the input dimension, right? So how much smaller is reasonable?
 it should be smaller than 4 (which we put in)
 in real life we have generally more dimensions*
What exactly should the SIC curve show again?