![](https://media.enccs.se/2024/11/julia-hpda-25-1536x768.webp) <p style="text-align: center"><b><font size=5 color=blueyellow>Julia for High-Performance Data Analysis - Day 3</font></b></p> :::success **Julia for High-Performance Data Analysis — Schedule**: https://hackmd.io/@yonglei/julia-hpda-2025-schedule ::: ## Schedule | Time | Contents | Instructor | | :---------: | :------: | :--------: | | 09:10-10:00 | Working with data, Saving Current Setup | FF | | 10:00-10:10 | Break | | 10:10-11:00 | Machine learning, Clustering and Classification, Deep learning (I) | AA | | 11:00-11:10 | Break | | | 11:10-11:55 | Machine learning, Clustering and Classification, Deep learning (II) | AA | | 11:55-12:00 | Q/A | | ## Lesson materials and recorded videos :::info - **Introduction to programming in Julia** - lesson material: https://enccs.github.io/julia-intro/ - recorded video: https://www.youtube.com/watch?v=EYNlE-zma7A&list=PL2GgjY1xUzfDlGVcvl757nEOxICgcGSWM&index=1 - **Julia for high-performance scientific computing** - lesson material: https://enccs.github.io/julia-for-hpc/ - recorded videos: https://www.youtube.com/watch?v=laCl9cXGOk4&list=PL2GgjY1xUzfDlGVcvl757nEOxICgcGSWM&index=2 - **Julia for high-performance data analytics** - lesson material: https://enccs.github.io/julia-for-hpda/ ::: --- :::danger You can ask questions about the workshop content at the bottom of this page. We use the Zoom chat only for reporting Zoom problems and such. ::: ## Questions, answers and information - Is this how to ask a question? - Yes, and an answer will appear like so! ### 5. [Working with data, Saving Current Setup](https://enccs.github.io/julia-for-hpda/data-science/) - which data format is the most widely used one for machine learning? CSV? JLD? - I'd say something like Parquet instead, mainly because much of big data management services like Spark use it; Julia has tooling to interface with Parquet, Feather/Arrow, so if you want to share with outside world those are better. If you already know that you'll be only handling Julia then go JLD(2). - :+1: - if we want to work on machine learning, we have to organize our data into the right format (either CSV or JLD)? - It depends if you're talking about training data or e.g. weights of neural networks. In general it is discouraged to use CSV for anything that is not text because it's clunky to store numbers in plain ASCII. If you plan to write an LLM then CSV or JSON are good choices to store the textual training data. - Thanks. :+1: :::info #### Break until XX:50 ::: ### 6. [Machine learning and deep learning](https://enccs.github.io/julia-for-hpda/data-science/#machine-learning) - Is Flux the only package for ML using Julia? Are there any other choices? - Flux is one choice. We will later also see MLJ (Machine Learning in Julia) which is another option. There is also a newer package called Lux suitable for larger neural networks. See also SimpleChains for quick implementations. - There are also ports of the scikit-learn Python package which might be familiar. - In short there are a lot of alternatives, here is a list of machine learning packages in Julia: https://juliapackages.com/c/machine-learning. - I can see there are lots of output with numbers and strings, is there are straightforward way to visualize the training results? - One can plot the values of the loss function (on training and test data) during training to see how training is progressing. See for example this graph in the training material: https://enccs.github.io/julia-for-hpda/regression/#id22. - In the code above it one can see how the plot is done. - for the split data into train and test dataset, are the data randomly distributed to train and test datasets? if yes, how can we ensure the reproducibility of the training? - Good question. Data is often split randomly in test and training but not always. To have reproducible training one can fix the seed to the random number generator so that others can run it and get the same thing. In some cases completely random split is not suitable because of risk of overfitting (for example if you train a computer vision model on images of a number of people, it may make sense to save a few unseen people as test data rather than just randomly splitting all the data in test and training data). - In Julia, one can do `rng = StableRNG(1111)` and to use the random number generator fix fixed seed `rand(rng, 10)`. - I am trying to run the copy pasted code, but it gives me the following Error message: `"ERROR: can't mix implict Params with explict rule from Optimisers.jl. - To use `Flux.params(m)` in `train!`, the 4th argument must be from the old `Flux.Optimise` sub-module. But better to use the new explicit style, in which `m` itself is the 2nd argument." - This has to do with the version of Flux used. Let me try to run it. The model itself would go into the loss function in the newer version, the code should be updated to this. - I have the version Flux v0.16.2 at the moment. Thank you very much I try to update accordingly. - I still did not manage to fully update the code (new version of train does not accept callback function) but it should be documented here https://fluxml.ai/Flux.jl/stable/reference/training/reference/#Flux.Train.train! . - Yes, that is right about callback. Sorry for the delay. I managed to change the code to make it run (not yet updated in the lesson material). It should work if you make the following changes. The loss takes the model as `loss(model, x, y) = Flux.crossentropy(model(x), y)` and we need the state of the optimizer `opt_state = Flux.setup(Adam(), model)`. The explicit parameters theta and optimizer are not used anymore `#opt = ADAM()` and `#θ = Flux.params(model)`. Then the training can be done with: ``` for epoch in 1:10 Flux.train!((m,x,y) -> loss(m, x, y), model, [(xtrain, ytrain)], opt_state) end ``` :::info #### Exercise until xx:00 ::: - If I have the same training using Julia and Python, which package is faster (or is the training dependent on the packages (scikit-learn/pytorch/tensorflow or flux))? - In general Julia is faster in my experience. But yes, I think it will depend on which packages you use in your specific problem. :::info #### Break until XX:10 ::: ### 7. [Clustering and Classification](https://enccs.github.io/julia-for-hpda/data-science/#clustering-and-classification) :::warning **Reflections and quick feedback:** One thing that you liked or found useful for your projects? - I liked the very accessible and understanable introduction to machine learning in julia, thank you! - I appreciated the very quick reaction to participants with "too new" version of the Flux package and the solution of related issues. One thing that was confusing/suboptimal, or something we should do to improve the learning experience? - xx ::: :::info *Always ask questions at the very bottom of this document, right **above** this.* :::