Welcome to the Book Club "Hands-On Machine Learning with R" - 1st Meeting

# Welcome to the Book Club "Hands-On Machine Learning with R" - 1st Meeting ## Some house rules to make the meeting nice for everyone - Please familiarize with our [code of conduct](https://rladies.org/code-of-conduct/). In summary, please be nice to each other and help us make an **inclusive** meeting! :purple_heart: - The meeting will NOT BE RECORDED but the slides will be shared! - Please list your name in the registry. - Make sure you're in the edit mode (ctrl + alt + e) when trying to edit the file! You'll know you're in the edit mode if the background is black :8ball: - Please keep your mic off during the presentation. It is nice if you have the camera on and participate to make the meeting more interactive. Of course you can have your camera off if you prefer. - If you have questions, raise your hand or write your question in this document. ### Links :link: Book/organisation: - [Book: "Hands-On Machine Learning with R"](https://bradleyboehmke.github.io/HOML/) - [GitHub Repository](https://github.com/rladiesnl/book_club_handsonML) - [Meeting Link](https://us02web.zoom.us/j/89588323742#success) - Meet-up pages: - [R-Ladies Utrecht](https://www.meetup.com/rladies-utrecht/) - [R-Ladies Den Bosch](https://www.meetup.com/rladies-den-bosch/) - Twitter - [@RLadiesUtrecht](https://twitter.com/RLadiesUtrecht) - [@RLadiesDenBosch](https://twitter.com/RLadiesDenBosch) ## Chapter 1-2: Introduction to Machine Learning + Modeling process ### Metaprogramming: Expressions ### Registry :clipboard: Name / pronouns / R level / place where you join from - Alejandra / she, her / Intermediate/ Utrecht, NL - Martine / she, her / Intermediate, Advanced-ish / 's-Hertogenbosch, NL - Soly / she, her / Beginner-intermediate-ish / Helsinki, FIN - Veerle / she, her / Den Bosch, NL - Stefan/he, him/ Moldova - Elena / she, her/ Intermediate-Advanced / Aarhus, Denmark - Inka / she,her/ Intermediate-Advanced/ Leipzig, GER - Shweta/ she, her/ Beginner-Intermediate/ Delhi, India. - Marianna/ she, her/ Intermediate/ Barcelona, Spain - Iris / she, her / Beginner / Utrecht, NL - Jaime/ he, him/ Intermediate / Aachen, Germany. - Oussama / He, Him / Intermediate-Advanced / Dijon, France - Ece / she, her / Intermediate / Rotterdam, NL ### Do you have any questions? :question: You can write them down here, and if you have answers to posted questions please go ahead, we are all learning together. - Does the approach for the preferred model depends on the aim of the analysis? In causal inference, choose for confounding minimization, select model based on the research question/data type. In the prediction world, choose the model, which outputs best prediction/least bias and acceptable variance? (from Elena) * Yes, the preferred modeling approach should be primarily driven by the the aims of the analysis. For example, we often have to think about how a model will be deployed in production (i.e., where the model will be used to score new data for business/research purposes), so how fast a model can score is often an important factor for us. A GLMNET or XGBoost model will often score faster than a large random forest or stacked ensemble when applied to large data sets. If the signal-to-noise ratio is rather low, statistical models (e.g., GLMNET and other regression models) tend to be more stable; see this post by Frank Harrell: https://hbiostat.org/blog/post/stat-ml/. However, no algorithm is universally superior to others in any situation, so it's important to consider trying several different models. I remember hearing Trevor Hastie once say he would typically start by fitting a GLMNET model (a statistical model with a more additive structure that's interpretable) and a random forest (a rich and more complex model that models lots of interaction effects) to see if the more complex model is finding more signal in the data. -- B. Greenwell - With K-fold cross-validation, the model from which fold is the final model? (question from Ale) * The final model is actually fit to all of the data after k-fold cross-validation (CV) is complete. For example, in 5-fold CV, you actually end up fitting 5 + 1 = 6 models. In 5 of the models, one of the folds is left out and treated as an independent test set, resulting in 5 independent performance assessments. The average of these assessments is the cross-validation score used for assessing the model's performance, then the final model is then constructed from the full sample. -- B. Greenwell - How do you know you are overfitting? Is it because of the number of predictors? How to find the balance? * It can be challenging to know whether you're underfitting, overfitting, or doing just right. Learning curves are really helpful here. For example, looking at how the generalization performance (e.g., as measure by cross-validation) in a polynomial regressiom model change as you vary the degree of the polynomial. More complex models (like gradient boosted decision trees) have lots more parameters to potentially vary, so learning curves become more challenging to compute. Fortunately, many algorithms (like gradient boosting and GLMNET) have regularization parameters that can help to avoid overfitting. -- B. Greenwell - Are the rules for choosing the proportion for train/test datasets, in practice, correspond to these in the book? 60%/40% or 80%/20%, for example? * I'm not aware of any rules for choosing the proportions (it's rather arbitrary to be honest). 70/20/10 for train/test/holdout is common, but not for any particular reason. If you don't feel like you have enough data to partition with enough left over to fit an adequate model, then cross-validation is your friend. But if you're in a data-rich situation (e.g., you have access to thousands or millions of observations), then keep in mind that there's a trade-off between accuracy and compute time; more complex models have more tuning parameters and it can take a long time to tune if you have lot's of training data. And on tabular data sets (like many of the ones used throughout this book), there tends to be a point of diminishing return where training on more rows will not lead to much in terms of increased performance (i.e., the algorithm has learned what it can). Personally, if I'm working with a good, but large data set (say millions of rows), I'd more inclined to use a 50/50 split, or something else. If I'm also doing feature selection, I might take a resonable subsample for that as well, then remove those records from training/testing (since I've already looked at them). -- B. Greenwell - Could you re-explain the Figure of the first KNN-model result (K tilde RMSE)? Does it mean that the best predictive model is the one that looks for the 6 closest hits to the datapoint? - I understand what actually is the intended way to function of a method, but how do you decide of which method of prediction do I use, also you used there the k neighbour something, but in which cases should I use other, and how many more are there? (Question from Jaime) ## About today's topic ### Any take-home messages you want to share? Help others remember the main points you took of these chapters: - **Supervised learners** predict. They can be continuous or categorical. Example: we can predict the price of a house based on the house features (which year it was built, city, neighborhood, etc.). In this example, the outcome is a continuous variable because the price is a number that can take different values that do not imply a 'category'. - **Unsupervised learners** describe. They can be defined by rows (clustering) or columns (dimension reduction). - Building a machine learning model is an iterative process. - In machine learning you usually **split** your data in training and testing data. A common approach is to use 70% of your data to train your model and the 30% left to test the model you developed. - You usually do a **K-fold cross-validation**. This means that you further subdivide the **training** data into k-parts (usually 5 or 10). You use k-1 parts to train the model and test different parameters in your model and use the last fold to test this version of the model. Next you re do this process but using a different fold for testing. At the end you can use the different folds and the different parameters you tested to find the version of the model. - **Bootstrapping:** resampling data with replacement - **Bias:** how far off are your models prediction from the true value - **Variance:** the variability of a model prediction (eg. overfitting -> no error in training data) ### Do you have any interesting links regarding the topic? :link: If you have suggestions of books/blog posts/articles, etc. that could help people getting further into the topic. Write them here: - Post by Frank Harrell: https://hbiostat.org/blog/post/stat-ml/ ### Feedback :left_speech_bubble: Please help us get better at this by giving us some feedback :sparkles: Things you liked or things that could improve! :smile: - WRITE YOUR FIRST COMMENT HERE! :point_left: ## Sign-up for presenting a chapter! - Chp 1-2 - Ger (12 Sept) - Chp 3-4 - Ale (26 Sept) - Chp 5-6 - Marianna Sebo (10 Oct) - Chp 7-8 - Elena Dudukina (24 Oct) - Chp 9-10 - Veerle (7 Nov) - Chp 12 - Ece (21 Nov) - Chp 11, 13-15 - TBD (TBD): Ece (only Chp 12 if possible) - Chp 16 - Brandon, co-author of the book (TBD) - Chp 17 - Shweta (would try to) (TBD) - Chp 18-19 - TBD (TBD) - Chp 20 - 21 -(22) - Martine (TBD)