Practical Machine Learning - Day 1

![](https://media.enccs.se/2025/06/practical-machine-learning.webp) Practical Machine Learning - Day 1 :::success **Practical Machine Learning — Schedule**: https://hackmd.io/@yonglei/practical-ml-2025-schedule ::: ## Schedule | Time | Contents | | :---------: | :------: | | 09:00-09:10 | Welcome | | 09:10-09:25 | Introduction to Machine Learning | | 09:25-09:50 | Fundamentals of Machine Learning | | 09:50-10:00 | Break | | 10:00-10:50 | Scientific Data for Machine Learning | | 10:50-11:00 | Break | 11:00-11:50 | Data Preparation for Machine Learning | | 11:50-12:00 | Wrap-up and Q&A | --- ## Exercises and Links :::warning - Exercises for [XXX]() ::: ## ENCCS lesson materials :::info - [**Practical Machine Learning**]() - [**Introduction to Deep Learning**](https://enccs.github.io/deep-learning-intro/) - [**High Performance Data Analytics in Python**](https://enccs.github.io/hpda-python/) - [**Julia for high-performance scientific computing**](https://enccs.github.io/julia-for-hpc/) - [**GPU Programming: When, Why and How?**](https://enccs.github.io/gpu-programming/) - [**ENCCS lesson materials**](https://enccs.se/lessons/) ::: :::danger You can ask questions about the workshop content at the bottom of this page. We use the Zoom chat only for reporting Zoom problems and such. ::: ## Questions, answers and information ### 2. Fundamentals of Machine Learning - Is this how to ask a question? - Yes, and an answer will appear like so! - for this workshop, why we install 2 identical packages: tensorflow and torch, not 1 of them? - It is true that both are deep-learning packages, but they are not the same in terms of the API. We will look at them in some of the hands-on notebooks. - Just an information (practical): I am more used to Google Colab, can I use it instead of Jupyter? Thanks - Yes, feel free to do so, as long as you know how to use it. - is it okay to ask aabout the environment set up ? - sure, either you paste your error info here or we can use break rooms - a break room would be really useful.. - DM me and I can assign you to the breakout room - Ashwin ```sh conda create -n practical_machine_learning python scikit-learn jupyterlab conda activate practical_machine_learning conda install numpy scipy pandas matplotlib seaborn jupyter-lab ``` - Things that are missing from the above: - keras - tensorflow - pytorch - umap-learn - In linux numpy is v1.26, not v2... hopefully this won't be a problem. - Should be OK. - ok, thanks - KW: here are the messages I get when testing the conda env: ``` Numpy version: 1.26.4 Pandas version: 2.3.2 Scipy version: 1.16.1 Matplotlib version: 3.10.6 Seaborn version: 0.13.2 Scikit-learn version: 1.7.1 2025-09-15 20:19:44.189239: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2025-09-15 20:19:44.252586: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1757960384.270109 1137084 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1757960384.275663 1137084 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2025-09-15 20:19:44.340695: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. Keras version: 3.11.2 Tensorflow version: 2.18.1 Pytorch version: 2.6.0 Umap-learn version: 0.5.3 Jupyter Notebook version: 7.4.5 ``` - These log messages are often emitted by `tensorflow` during import. Safe to ignore. - PZS: is pyplot a function from matplotlib or a subset of functions? why didn't we import the entire matplotlib? - `pyplot` is a subpackage of matplotlib and historically it was designed to mimic Matlab's plotting API as it is. Even today, `import matplotlib.pyplot as plt` continues to be the most common way of using most of matplotlib's features. ### 3. Scientific Data for Machine Learning :::info How large is the data you are working with? - ~ MB +2 - ~ GB +3 - ~ TB +1 - Steaming data +1 Are you experiencing performance bottlenecks when you try to analyse it? If yes, how would you address this issues? - XX ::: - Sorry, stupid question, where can I find these jupyter notebooks? - No stupid questions here :smile:. If you have git, you can run ```sh git clone https://github.com/ENCCS/practical-machine-learning ``` If not, go to https://github.com/ENCCS/practical-machine-learning and you should see a *Download zip* button. - ![image](https://hackmd.io/_uploads/ryX3xjUjgx.png) - Then browse to `content/jupyter-notebooks` - If I'm not wrong in section 2.2 of the Tensor.ipynb, first cell, it should say oneD_tensor and not twoD_tensor. - yes you are right, :+1: - In the repo I only see notebooks 3 to 6 and noticed more in Yonglei's computer. - YES, 7th and 8th havn't uploaded - OK, thanks - What is the difference between torch.ones() and torch.ones_like() and such? - I guess `.ones()` takes the shape/dimensions of the desired tensor as input and `.ones_like()` gets another tensor as input and uses its shape/dimensions as a reference for creating the new one. - for `torch.ones(3,3)`, you should tell torch the dimension of the tensor to generate - for `ones_like` (torch.ones_like(a_tensor)), there is no nessary to give info about dimensions, torch will generate a torch similar to another tensor (it should be existing when you run the code) ### 4. Data Preparation for Machine Learning - (just curious) What happens if the data does not fit in memory - common problem with `pandas`, usually resolved with `dask` - how it is in general with ML methods? - a feasible way is to split the data into multiple small sets (in ML/DL it is called batch) - during training, each time we can send a small batch to the model for training, during the training we send another batch, so each time the algorithm will traing the model using a small amount of data - Libraries like Pytorch and HuggingFace often used in training ML/DL have their own solutions for a "data loader", which does batch processing of data. Dask Dataframes is also quite useful as you noted. - https://docs.pytorch.org/tutorials/beginner/basics/data_tutorial.html - https://huggingface.co/docs/datasets/index :::warning **Reflections and quick feedback:** One thing that you liked or found useful for your projects? - The general ideas on data for ML - Data Preparation process - Emphasizing the data preparation - General ideas on ML algorithms One thing that was confusing/suboptimal, or something we should do to improve the learning experience? - Maybe few practical example of working with tensors - Some statistical concepts on the penguin processing data where a bit out of my scope, but not your fault. Perhaps a bit of "whiteboard" depiction or explanation would have helped. - Statistical characteristics of the data not properly taken into account: you always assumed "normal" distribution, but that's not always the case and concepts like std dev or IQR have to be revised ::: :::danger *Always ask questions at the very bottom of this document, right **above** this.* ::: ---