
<p style="text-align: center"><b><font size=5 color=blueyellow>Practical Machine Learning - Day 1</font></b></p>
:::success
**Practical Machine Learning — Schedule**: https://hackmd.io/@yonglei/practical-ml-2025-schedule
:::
## Schedule
| Time | Contents |
| :---------: | :------: |
| 09:00-09:10 | Welcome |
| 09:10-09:25 | Introduction to Machine Learning |
| 09:25-09:50 | Fundamentals of Machine Learning |
| 09:50-10:00 | Break |
| 10:00-10:50 | Scientific Data for Machine Learning |
| 10:50-11:00 | Break
| 11:00-11:50 | Data Preparation for Machine Learning |
| 11:50-12:00 | Wrap-up and Q&A |
---
## Exercises and Links
:::warning
- Exercises for [XXX]()
:::
## ENCCS lesson materials
:::info
- [**Practical Machine Learning**]()
- [**Introduction to Deep Learning**](https://enccs.github.io/deep-learning-intro/)
- [**High Performance Data Analytics in Python**](https://enccs.github.io/hpda-python/)
- [**Julia for high-performance scientific computing**](https://enccs.github.io/julia-for-hpc/)
- [**GPU Programming: When, Why and How?**](https://enccs.github.io/gpu-programming/)
- [**ENCCS lesson materials**](https://enccs.se/lessons/)
:::
:::danger
You can ask questions about the workshop content at the bottom of this page. We use the Zoom chat only for reporting Zoom problems and such.
:::
## Questions, answers and information
### 2. Fundamentals of Machine Learning
- Is this how to ask a question?
- Yes, and an answer will appear like so!
- for this workshop, why we install 2 identical packages: tensorflow and torch, not 1 of them?
- It is true that both are deep-learning packages, but they are not the same in terms of the API. We will look at them in some of the hands-on notebooks.
- Just an information (practical): I am more used to Google Colab, can I use it instead of Jupyter? Thanks
- Yes, feel free to do so, as long as you know how to use it.
- is it okay to ask aabout the environment set up ?
- sure, either you paste your error info here or we can use break rooms
- a break room would be really useful..
- DM me and I can assign you to the breakout room - Ashwin
```sh
conda create -n practical_machine_learning python scikit-learn jupyterlab
conda activate practical_machine_learning
conda install numpy scipy pandas matplotlib seaborn
jupyter-lab
```
- Things that are missing from the above:
- keras
- tensorflow
- pytorch
- umap-learn
- In linux numpy is v1.26, not v2... hopefully this won't be a problem.
- Should be OK.
- ok, thanks
- KW: here are the messages I get when testing the conda env:
```
Numpy version: 1.26.4
Pandas version: 2.3.2
Scipy version: 1.16.1
Matplotlib version: 3.10.6
Seaborn version: 0.13.2
Scikit-learn version: 1.7.1
2025-09-15 20:19:44.189239: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-09-15 20:19:44.252586: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1757960384.270109 1137084 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1757960384.275663 1137084 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-09-15 20:19:44.340695: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Keras version: 3.11.2
Tensorflow version: 2.18.1
Pytorch version: 2.6.0
Umap-learn version: 0.5.3
Jupyter Notebook version: 7.4.5
```
- These log messages are often emitted by `tensorflow` during import. Safe to ignore.
- PZS: is pyplot a function from matplotlib or a subset of functions? why didn't we import the entire matplotlib?
- `pyplot` is a subpackage of matplotlib and historically it was designed to mimic Matlab's plotting API as it is. Even today, `import matplotlib.pyplot as plt` continues to be the most common way of using most of matplotlib's features.
### 3. Scientific Data for Machine Learning
:::info
How large is the data you are working with?
- ~ MB +2
- ~ GB +3
- ~ TB +1
- Steaming data +1
Are you experiencing performance bottlenecks when you try to analyse it? If yes, how would you address this issues?
- XX
:::
- Sorry, stupid question, where can I find these jupyter notebooks?
- No stupid questions here :smile:. If you have git, you can run
```sh
git clone https://github.com/ENCCS/practical-machine-learning
```
If not, go to https://github.com/ENCCS/practical-machine-learning and you should see a *Download zip* button.
- 
- Then browse to `content/jupyter-notebooks`
- If I'm not wrong in section 2.2 of the Tensor.ipynb, first cell, it should say oneD_tensor and not twoD_tensor.
- yes you are right, :+1:
- In the repo I only see notebooks 3 to 6 and noticed more in Yonglei's computer.
- YES, 7th and 8th havn't uploaded
- OK, thanks
- What is the difference between torch.ones() and torch.ones_like() and such?
- I guess `.ones()` takes the shape/dimensions of the desired tensor as input and `.ones_like()` gets another tensor as input and uses its shape/dimensions as a reference for creating the new one.
- for `torch.ones(3,3)`, you should tell torch the dimension of the tensor to generate
- for `ones_like` (torch.ones_like(a_tensor)), there is no nessary to give info about dimensions, torch will generate a torch similar to another tensor (it should be existing when you run the code)
### 4. Data Preparation for Machine Learning
- (just curious) What happens if the data does not fit in memory - common problem with `pandas`, usually resolved with `dask` - how it is in general with ML methods?
- a feasible way is to split the data into multiple small sets (in ML/DL it is called batch)
- during training, each time we can send a small batch to the model for training, during the training we send another batch, so each time the algorithm will traing the model using a small amount of data
- Libraries like Pytorch and HuggingFace often used in training ML/DL have their own solutions for a "data loader", which does batch processing of data. Dask Dataframes is also quite useful as you noted.
- https://docs.pytorch.org/tutorials/beginner/basics/data_tutorial.html
- https://huggingface.co/docs/datasets/index
:::warning
**Reflections and quick feedback:**
One thing that you liked or found useful for your projects?
- The general ideas on data for ML
- Data Preparation process
- Emphasizing the data preparation
- General ideas on ML algorithms
One thing that was confusing/suboptimal, or something we should do to improve the learning experience?
- Maybe few practical example of working with tensors
- Some statistical concepts on the penguin processing data where a bit out of my scope, but not your fault. Perhaps a bit of "whiteboard" depiction or explanation would have helped.
- Statistical characteristics of the data not properly taken into account: you always assumed "normal" distribution, but that's not always the case and concepts like std dev or IQR have to be revised
:::
:::danger
*Always ask questions at the very bottom of this document, right **above** this.*
:::
---