# Julia for HPDA - lesson development ## Meeting Sep 18 2023 2. set up - David, Anastasiia 3. lektions delar – fixa 4. testa lektions delar 5. motivation_hpda - Anastasiia 6. automatisk test ?? - alla | include extern codefil –ci actions 7. https://sphinx-themes.org/sample-sites/sphinx-book-theme/kitchen-sink/admonitions/ – uppdatera vissa - AA 8. https://coderefinery.github.io/github-without-command-line/group-work/#optional-exercise-contributing-larger-changes 9. ovningar - 15-20 mins | ha optional 10. ## Meeting May 8 2023 ### Suggested selection of content: 1. General introduction to Julia, packages, syntax etc.. 2. Some linear algebra and array operations. 3. Dataframes, manipulation/slicing, visualization. Use Penguin data. Illustrate data reading/writing. 4. Clustering, classification, machine learning and deep learning. Use penguin data. Suggestions: make an exercise to implement k-means clustering (can use https://web.stanford.edu/~boyd/vmls/vmls-julia-companion.pdf). 5. Regression and time series prediction. Use climiate data. More visualization. 6. Can perhaps use this weather/climate data: https://www.kaggle.com/datasets/sumanthvrao/daily-climate-time-series-data 7. Topics in parallel computing? Dagger? https://github.com/JuliaParallel 8. MCMC, probabilistic programming (gen.jl) 9. Can do optimization examples Day 3. 10. Piping av funktion (sammansättning). Illustrera. - se också https://github.com/jkrumbiegel/Chain.jl ### Schema, tre halvdagar #### Day 0 (prereq) - introduction to Julia. - set up - motivation - intro to Julia syntax - Special features of Julia - Devloping in Julia #### Day 1, 3h - motivation (Julia for data analysis) - 15 min - dataframes, visualisation, various dataformats, read/write data, missing data (?) - 1.5 hour (Anastasiia) - linear algebra, array, matrix and vector operations, performance comparison, random matrices, sparse matrices, eigenvalues/eigenvectors and PCA, Gram-Schmidt process - 1.5 hour (David) #### Day 2, 3h - Illustrera piping? - Clustering, classification (Anastasiia), machine learning, deep learning (toy example). Use penguin data. - 3 hours #### Day 3, 3h - Regression, time series analysis/prediction. Use climate data. Visualisation. - 3 hour (David) #### Day 3 older, 3h - Dagger - Thor **on second thought, Dagger.jl fits better in the HPC lesson!** - Parallelisation -> the HPC lesson should be where parallelisation is covered ### More sources https://carpentries-incubator.github.io/deep-learning-intro/ ### Libraries - CSV.jl - LinearAlgebra.jl - Dataframes.jl - MLJ.jl --- ## Meeting April 13 2023 ### Notes - dataframes: skapa nya df, merge df liksom in SQL; - use PCA as example for how to do things from the ground up, combining linear algebra and data analysis. Explain how to compute eigenvalues, eigenvectors, do diagonalization and PCA. Do PCA directly using package. - https://jump.dev/JuMP.jl/stable/ ? consider whether this should be covered or is too wide - https://github.com/JuliaParallel/Elemental.jl distributed linear algebra and optimization - new material: - https://github.com/ENCCS/julia-for-hpda/ - https://enccs.github.io/julia-for-hpda/ - corresponding material for Python:https://enccs.github.io/hpda-python/ - collaborating with CSC on https://enccs.github.io/julia-for-hpc/, possible synergies with this effort - clustering/PCR: https://github.com/pabvald/julia-for-data-science ### what to include in lesson: - classical datascience: dataframes.jl, visualisation, ... - classical ML: https://alan-turing-institute.github.io/MLJ.jl/dev/ - deep learning: Flux.jl - Dagger.jl for parallel tasks - sciML like DifferentialEquations.jl etc? https://sciml.ai/ - automatic differentiation? - zygote.jl etc - https://github.com/colonyos/Colonies.jl 1. Reading/writing data, CSV.jl, netCDF? etc - different types of data: images, text, numerical data 2. Dataframes.jl: slicing, data preprocessing/wrangling, statistical analysis and visualization etc 3. Dagger.jl maybe 4. Clustering, regression, classification (MLJ.jl) 5. Deep learning 6. Time serier forcasting/analysis. 7. Section on linear algebra: solving equations, eigenvalues/eigenvectors, matrix multiplicaiton and factorization, singular value decomposition, PCA, orthogonal matrices etc.. 8. Fourier transform, digital signal processing, filters. 9. Curve fitting in Julia, other optimization algorithms. 10. May be convienient to have one dataset for which we can illustrate: clustering, regression, classification and deep learning. Perhaps also timeseries prediciton. Or at least keep try to find reuse of datasets for different topics. 11. Perhaps penguin data for classificaiton, clustering and deep learning and climate data (https://www.kaggle.com/datasets/sumanthvrao/daily-climate-time-series-data) for timeseries prediciton and regression. N-1. Automatic differentiation N. PINNs, neural ODEs, CFD etc. #### Inspiration for content and structure: 1. https://computationalthinking.mit.edu/Spring21/installation/ - Module 1: Images, Transformations, Abstractions - Module 2: Social Science & Data Science - Module 3: Climate Science - se https://computationalthinking.mit.edu/Spring21/ 2. https://juliaacademy.com/courses/julia-for-data-science/lectures/17339299 3. https://juliaacademy.com/courses/introduction-to-machine-learning/lectures/6003599 - namn: High performance data analysis in Julia? - omfattning: ca 3 halvdagar - ämnesområde: plock från ML, High-Performance Data Analytics HPDA, deep learning, sciML (fokus på HPDA?) - deltagarnas bakgrund: - folk som försöker komma igång med dataanalys - folk som redan har lite erfarenhet av ML osv men inte i Julia ## Relevant links: 1. https://alan-turing-institute.github.io/MLJ.jl/dev/ 2. https://docs.sciml.ai/Overview/stable/ 3. https://fluxml.ai/Flux.jl/stable/ 4. https://juliaml.github.io/ 5. https://github.com/JuliaML 6. https://github.com/QuantumBFS/Yao.jl 7. https://juliaacademy.com/courses/julia-for-data-science/lectures/17339299 8. https://computationalthinking.mit.edu/Spring21/ 9. https://github.com/colonyos/Colonies.jl 10. Many libraries: https://enccs.github.io/julia-for-hpda/scientific-computing/ 11. https://www.gen.dev/docs/dev/ 12. https://github.com/probcomp/gen-quickstart/blob/master/tutorials/Iterative%20inference%20in%20Gen.ipynb 13. https://jump.dev/ 14. https://github.com/bkamins/Julia-DataFrames-Tutorial 15. ### Anastasiia's comments: 1. Explain why Project.toml with capital P; 2. Julia REPL (Read-Eval-Print Loop) not written; 3. Developing in Julia Creating a new environment In preparation for the next section on data science techniques in Julia, create a new environment named datascience in a new directory, activate it and install the following packages: Create a new env or directory? Or a new directory is a new env?; 4. navigate to the datascience directory – write how; 5. How long was this workshop? 6. Add this: Once IJulia is installed, one can launch Jupyter Notebook by running in the Julia REPL: ``` julia> Using IJulia julia> notebook() ``` 7. ``` using DataFrames names = ["Ali", "Clara", "Jingfei", "Stefan"] age = ["25", "39", "64", "45"] df = DataFrame(; name=names, age=age) ``` Do not need `using DataFrames` again. 8. Add `using DataFrames` to ``` using PalmerPenguins table = PalmerPenguins.load() df = DataFrame(table) # the raw data can be loaded by #tableraw = PalmerPenguins.load(; raw = true) first(df, 5) ``` 9. I ran this block ⬆️ and it did not work ``` ERROR: UndefVarError: df not defined Stacktrace: [1] top-level scope @ ~/ENCCS/julia/julia4hpda/datascience.jl:9 ``` When I ran line by line, it worked. 10. ``` # slicing df[1, 1:3] ``` Much better to have the line of code separetly to copy it via the copy-button. 11. *slicing and column name (can also use "island")* – what island? 12. *access column directly without copying* – what is the difference between copying and accessing directly? For example: When you access a column in a DataFrame directly, you are working with a reference to the original data in the DataFrame. This means that any changes you make to the column will also be reflected in the DataFrame. For example, if you access a column directly using the dot syntax df.column_name and modify its values, the values in the DataFrame will also be modified. On the other hand, when you copy a column from a DataFrame, you create a new object that contains the same data as the original column but is independent of the DataFrame. This means that any changes you make to the copied column will not affect the original data in the DataFrame. In summary, accessing a column directly allows you to modify the data in the DataFrame, while copying a column creates a new object that is independent of the DataFrame. 13. *First we install Plots.jl and StatsPlots backend:* – I think to add `using Pkg` 14. Add this: In the code gr(), the gr function is used to set the plotting backend to GR. A plotting backend is a library that Plots.jl uses to actually create and render the plots. Plots.jl supports multiple plotting backends, including GR, PyPlot, and Plotly. Each backend has its own strengths and weaknesses, and you can choose the one that best fits your needs. The gr() function sets the current backend to GR, which is a high-performance plotting library. After running gr(), all subsequent plots will be created using the GR backend. You can switch to a different backend at any time by calling the corresponding function, such as pyplot() or plotly(). 15. x = 1:10; y = rand(10, 2) – add: In the code `y = rand(10, 2)`, the rand function is used to generate a 10x2 matrix of random values. Each value in the matrix is a random number between 0 and 1, generated using a uniform distribution. 16. *The core principle of SciML is differentiable programming - the ability to automatically differentiate any code and thus incorporate it into Flux models.* – explain Flux. Differentiable programming is a programming paradigm in which a numeric computer program can be differentiated throughout via automatic differentiation. This allows for gradient-based optimization of parameters in the program, often via gradient descent, as well as other learning approaches that are based on higher-order derivative information 1. Flux is a machine learning library for the Julia programming language. It comes with many useful tools built-in and allows you to use the full power of the Julia language where you need it 2. Flux models are conceptually predictive functions and can be built by chaining together a series of Flux layers 3. In the context of SciML, differentiable programming allows for the automatic differentiation of any code and its incorporation into Flux models. This means that you can write code to represent a scientific model or simulation and then use automatic differentiation to optimize its parameters using gradient-based methods. OR/AND: Differentiable programming is like having a magic wand that can help you make your computer program better. Imagine you have a toy car and you want it to go faster. You can use the magic wand to find out which parts of the car you need to change to make it go faster. Then you can change those parts and the car will go faster! Flux is like a big box of building blocks that you can use to make your own toy car. You can choose which blocks to use and how to put them together to make the car you want. And if you use the magic wand, you can make your car even better! 17. We can use AI to easily transfer TF/PyTorch/other in Julia versions. 18. *one-hot encoding* – explain and maybe other ML vitals. One-hot encoding is a way to represent categorical data as numerical data that a machine learning model can understand. 19.