
<p style="text-align: center"><b><font size=5 color=blueyellow>Practical Machine Learning - Day 2</font></b></p>
:::success
**Practical Machine Learning — Schedule**: https://hackmd.io/@yonglei/practical-ml-2025-schedule
:::
## Schedule
| Time | Contents |
| :---------: | :------: |
| 09:00-09:05 | Soft start |
| 09:05-10:25 | Supervised Learning (I): Classification |
| 10:25-10:35 | Break |
| 10:35-11:55 | Supervised Learning (II): Regression |
| 11:55-12:00 | Wrap-up and Q&A |
---
## ENCCS esson materials
:::info
- [**Practical Machine Learning**]()
- [**Introduction to Deep Learning**](https://enccs.github.io/deep-learning-intro/)
- [**High Performance Data Analytics in Python**](https://enccs.github.io/hpda-python/)
- [**Julia for high-performance scientific computing**](https://enccs.github.io/julia-for-hpc/)
- [**GPU Programming: When, Why and How?**](https://enccs.github.io/gpu-programming/)
- [**ENCCS lesson materials**](https://enccs.se/lessons/)
:::
:::danger
You can ask questions about the workshop content at the bottom of this page. We use the Zoom chat only for reporting Zoom problems and such.
:::
## Questions, answers and information
- Is this how to ask a question?
- Yes, and an answer will appear like so!
:::success
**Statistical characteristics of the data not properly taken into account: you always assumed "normal" distribution, but that's not always the case and concepts like std dev or IQR have to be revised**
- First to check the distribution
- Histograms, QQ-plots
- Alternatives to "normality-based" scaling
- Non-parametric scaling
- Quantile transformation: maps data to a uniform or normal distribution
```
from sklearn.preprocessing import QuantileTransformer
qt = QuantileTransformer(output_distribution='uniform')
X_scaled = qt.fit_transform(X)
```
- Transform skewed data
```
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='yeo-johnson')
X_trans = pt.fit_transform(X)
```
:::
### [5. Supervised Learning (I): Classification](https://enccs.github.io/practical-machine-learning/05-supervised-ML-classification/)
- Why do you scale training and test datasets individually? Wouldn't it make more sense to scale the entire dataset first before splitting in training and test?
- It is important to keep the training and test datasets separate, so they do not influence each other -- that is one of the reasons.
- Another reason is that the test dataset is indicative of the real-world performance of the model. Therefore the training data cannot be scaled on something it is yet to observe during inference.
- Wnat is the meaning of "precision, F1, etc" scores?
- These are some of the evaluation metrics used for checking model performance.
- $\text{Precision} =
\frac{\text{correctly classified actual positives}}
{\text{everything classified as positive}}
= \frac{TP}{TP+FP}$
- $\text{Recall (or TPR)} =
\frac{\text{correctly classified actual positives}}{\text{all actual positives}}
= \frac{TP}{TP+FN}$
- $\text{F1}=2*\frac{\text{precision * recall}}{\text{precision + recall}}
= \frac{2\text{TP}}{2\text{TP + FP + FN}}$
- The exact choice of this depends on the use case. [More on this here](https://developers.google.com/machine-learning/crash-course/classification/accuracy-precision-recall).
- Thx
- You're welcome :+1:
- ==How does KNN start? Say you want 3 neighbours, how are the first 3 members of the dataset classified?==
- I will let @yonglei take this one.
- if we set `k=3`, we first calculate the distance of between the entry point (green) and all other data points in the dataset
- we only consider the 3 data points with shorted distances to the entry point
- the mode of the 3 data points is the prediction result (which group it should belong to) for the entry point
- Sure, I understand how you classify points when you already have centers of gravity. But how do you get them when you start from scratch? The frist datapoint is the first center of gravity, but then the 2nd point? doe it belong to the same group as the 1st, or is it the nuclei for a second group? Same for point #3. Once we have the 3 groups it's easy to classify.
- Confusion matrix can be calculated because we know species for all the penguins from the dataset, but if we don't know that and we are trying to make predictions for this 20% of the penguins, then we cannot know if we determined their species correct or not correct?
- Both training and test datasets should be sampled properly and should be representative of the classes that we aim to train on. As long as they are "balanced" they can classify the species. However no ML model can determine species it has not been trained on. Does that make sense?
- Yes, thank you.
- YW: if we don't know the species, we cannot make predictions, say one penguine, we don't know it is Adelie or Gentoo, even we predict something, we don't the prediction is right or wrong.
- So, if we don't have labels (targets), we should use unsupervised learning, which will be delivered tomorrow morning.
- Is it necessary to verify the properly choice of the hyperparameters? In case, how to proceed?
- Hyperparameters are the final optimization performed at the end of the training and it gives a small, but meaningful boost. If the model is not performing good enough, usually switching to **another method** can give huge boosts than trying to tweak hyperparameters.
- What is the output when you predict categories with neural networks? 30% chance for being an Adelie and 70% chance for beeing a Chinstrap?
- In neural networks, we use some encoding (here one-hot encoding) for the output classes. For 3 classes it makes a vector of size 3
- An example for a perfect classification for Adelie, would look like [1.0, 0.0, 0.0]
- A neural network might give out something like [0.30, 0.69, 0.01], which suggests that, as you said there is 30% chance of being Adelie and 69% Chinstrap and 1% chance of being Gentoo.
- What it actually is referring to is the **distance** of the output vector in the 3D encoding space we built from each of the output classes (also vectors).
- Thank you, I see. But say we end up with [0.4, 0.3, 0.3], how would we classify this? Could we run some of the other methods for classification on the results from the NN?
- That is one possible approach, to resort to another method. Another would be to flag such observations for the human-in-the-loop.
- algorithm always pick up the one with the high score
- say you got [0.48, 0.49, 0.03], algorithm will pickup the class with 0.49
- Is it considered "overfitting of the model" if the accuracy is 1.0?
- Overfitting is said to happen when the model performs well in training data, and performs worse in a test data. In such cases, one reason could be that the model over-adapted solely to the features of the training data.
- Thank you!
- Would it be possible to have some time to discuss with you our projects, to explain our dataset and ideas, and to hear your suggestions on how to start and what would be appropriate choice of algorithm?
- YW: how about after the workshop, say 12:30?
- YW: send me an email with zoom link and we can talk for ~ 30 minutes.
- PZS: That's perfect. Thank you. I will send you zoom link.
- During training, the acuracy is 0.99% and for validation is 0.85 accuracy. But for the unseen test, the accuracy is 50%. This is a sign for overfitting. How to solve this?
- Hyperparameter tuning, data preparation or choosing another method can help.
- In deep-learning, during training when validation loss stops improving, early stopping is used to avoid this. [This method is also available in scikit-learn](https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_early_stopping.html).
### [6. Supervised Learning (II): Regression](https://enccs.github.io/practical-machine-learning/06-supervised-ML-regression/#supervised-learning-ii-regression)
- ==What is a Q-Q plot?==
- It refers to quantile-quantile plot.
- Sure, but what does it show? In my world quantiles are just a vector with 3 values.
- Quantiles are similar to percentiles and they are used to describe probability distributions.
- I still don't get it. If I have a distribution I can compute quantiles, these are just 3 number for the entire distribution. But the plot shows many more points
- You can calculate several quantiles for any floating point value between 0.0 and 1.0.
- Let's also ask @yonglei for more clarification.
- Is there a way to find the optimal degree for the polynomial when doing polynomial fitting
- You can use a grid search.
- A general rule of thumb is that higher degrees (typically above degree=3) tend to overfit.
- You can try different numbers and compare the scores on training and testing sets.
- if the perdictioin on testing set gives a (really) lower score than training set, it is overfitting
- you can plot the scores vs degrees to see the critical point
- Thanks. And how about the depth of the decision tree? Any rule of thumb?
- Similarly, higher depths tend to overfit, but no rule of thumb. Think of decision tree as a big `if-else` tree with many
- A more general question: what is the difference between MLP and DNN? Both have multiple layers, right? So what distinguishes them?
- you can think it as this: MLP is a subset of DNN
- MLP always has less hidden layes, say < 10
- DNN will have more hidden layers, say > 100
- Is it possible to include SHAP analysis in this workshop?
- YW: Good suggestion, will add relevant content (at least some code examples) in the next round workshop.
:::danger
*Always ask questions at the very bottom of this document, right **above** this.*
:::
---