# Welcome to the Book Club "Hands-On Machine Learning with R" - 2nd Meeting
## Some house rules to make the meeting nice for everyone
- Please familiarize with our [code of conduct](https://rladies.org/code-of-conduct/). In summary, please be nice to each other and help us make an **inclusive** meeting! :purple_heart:
- The meeting will NOT BE RECORDED but the slides will be shared!
- Please list your name in the registry.
- Make sure you're in the edit mode (ctrl + alt + e) when trying to edit the file! You'll know you're in the edit mode if the background is black :8ball:
- Please keep your mic off during the presentation. It is nice if you have the camera on and participate to make the meeting more interactive. Of course you can have your camera off if you prefer.
- If you have questions, raise your hand or write your question in this document.
### Links :link:
Book/organisation:
- [Book: "Hands-On Machine Learning with R"](https://bradleyboehmke.github.io/HOML/)
- [GitHub Repository](https://github.com/rladiesnl/book_club_handsonML)
- Hack Md's
- [Chp 1, 2](https://hackmd.io/EhYe_gkWScuoaVCIH6QLAg?both)
- [Meeting Link](https://us02web.zoom.us/j/89588323742#success)
- Meet-up pages:
- [R-Ladies Utrecht](https://www.meetup.com/rladies-utrecht/)
- [R-Ladies Den Bosch](https://www.meetup.com/rladies-den-bosch/)
- Twitter
- [@RLadiesUtrecht](https://twitter.com/RLadiesUtrecht)
- [@RLadiesDenBosch](https://twitter.com/RLadiesDenBosch)
## Chapter 3-4: Feature and Target Engineering and Linear Regression
### Registry :clipboard:
Name / pronouns / R level / place where you join from
- Alejandra / she, her / Intermediate / Utrecht, NL
- Soly / she, her / Beginner / Helsinki, FIN
- Elena / she, her / upper intermediate/advanced / Aarhus, Denmark
- Oussama / He, Him / Advanced / Dijon, FR
- Veerle / she, her / advanced / Den Bosch, NL
- Martine / she, her / upper intermediate-advanced / 's-Hertogenbosch (is also Den Bosch) NL
- P. Jason Toppin/He, Him/ Intermediate/Kingston, Jamaica
- Ece / she, her / Intermediate / Rotterdam, NL
### Do you have any questions? :question:
You can write them down here, and if you have answers to posted questions please go ahead, we are all learning together.
- WRITE YOUR QUESTION HERE! :point_left:
- How ubiquitous is log transformation of data? Would use with any continuous variable with slight deviation from normality (slight skewness/kurtosis present) on its "natural" scale? **Used often in biological sciences**
- Any difference with log-scaling or normalization (like Blom)?
- In my field (epidemiology), the missingness is classified as missing completely at random (by chance, error in data recording), missing at random (the missingness mechanism can be taken into account using available variables), missing not at random (missingness mechanism is unknown and measured data cannot account for it). For me, in Ames data missing not at random would be something like, no one from low income area reported the number of garage space in the property and related data.
- KNN: How do you know which features are the ones that the prediction is based on? I mean if you choose the 6 neighboring.. Ok, thank you!(sorry I'm having some connection problems, can't use mic or video)
## About today's topic
### Any take-home messages you want to share?
Help others remember the main points you took of these chapters:
- Missing data can have different causes. **Informative missingness** has a structural cause (eg. bias in data collection), **missingness at random** is independent of data collection
- **Imputation** is substituting missing values with another value. This can be based on statistics (mean or mode), based on similar groups, or based on a model (eg K-nearest neighbors or tree-based models). The last two are often more accurate, but also more time-consuming.
- Feature filtering: sometimes we have too many features! This makes it hard to interpret and computationally costly. Many features are non-informative. First step: remove features with **near-zero variance** (less than 10% of values are unique)
- **Numerical feature engineering**: Log transformation, standardization
- **Categorical feature engineering**: Lump little-used categories together, one-hot, dummy encoding
- **Dimension reduction**: filter out non-informative features. For example using Principal Component Analysis (PCA)
- Package [recipes](https://recipes.tidymodels.org/) is useful to plan your preprocessing step by step! The order of the steps matters.
### Do you have any interesting links regarding the topic? :link:
If you have suggestions of books/blog posts/articles, etc. that could help people getting further into the topic. Write them here:
- [Recipes R-package](https://recipes.tidymodels.org/)
### Feedback :left_speech_bubble:
Please help us get better at this by giving us some feedback :sparkles: Things you liked or things that could improve! :smile:
- WRITE YOUR FIRST COMMENT HERE! :point_left:
## Sign-up for presenting a chapter!
- Chp 1-2 - Ger (12 Sept)
- Chp 3 - Ale (26 Sept)
- Chp 5-6 - Marianna Sebo (10 Oct)
- Chp 7-8 - Elena Dudukina (24 Oct)
- Chp 9-10 - Veerle (7 Nov)
- Chp 12 - Ece (21 Nov)
- Chp 11, 13-15 - TBD (TBD): Ece (only Chp 12 if possible)
- Chp 16 - Brandon, co-author of the book (TBD)
- Chp 17 - Shweta (would try to) (TBD)
- Chp 18-19 - TBD (TBD)
- Chp 20 - 21 -(22) - Martine (TBD)