![](https://media.enccs.se/2023/06/julia-hpda-enccs-rise.jpg) # ENCCS - Julia for High Performance Data Analytics ### October 2 - 5, 2023 ## General information - This hackMD: https://hackmd.io/@enccs/julia-hpda-oct2023 - Archive hackMD: https://hackmd.io/@enccs/SJ5mNAOPh - Training material: - intro: https://enccs.github.io/julia-intro/ - HPDA: https://enccs.github.io/julia-for-hpda/ - Workshop feedback form: https://events.prace-ri.eu/event/1513/surveys/1096 - Workshop will be recorded but participant interactions edited out before publishing. ### Follow [ENCCS](https://enccs.se/) - Upcoming events: https://enccs.se/events/ - Sign up for the newsletter: https://enccs.se/newsletter - Follow us on [LinkedIn](https://www.linkedin.com/company/enccs), or [Twitter](https://twitter.com/EuroCC_Sweden) - YouTube channel: https://www.youtube.com/@enccs ### Schedule **October 2, 09:00-12:30 CEST** |Time | Topic | | --- | ----- | |9:00-9:25 | Welcome and motivation| |9:25-10:05 | Julia syntax| |10:05-10:20 | Break| |10:20-11:20 | Special Julia features| |11:20-12:20 | Developing in Julia| |12:20-12:30 | Package ecosystem| **October 3, 9:00-12:00 CEST** |Time | Topic | | --- | ----- | |9:00-9:10|Welcome| |9:10-9:25 | Motivation (Julia for data analysis)| |9:25-9:45 | Data formats and data frames| |9:45-10:00 | Break| |10:00-10:45 | Data formats and data frames| |10:45-11:00 | Break| |11:00-12:00 | Linear algebra| **October 4, 9:00-12:00 CEST** |Time | Topic | | --- | ----- | | 9:00-9:45 | Machine learning | | 9:45-10:00 | Break | | 10:00-10:45 | Clustering and Classification | | 10:45-11:00 | Break | | 11:00-12:00 | Deep learning | **October 5, 9:00-12:00 CEST** |Time | Topic | | --- | ----- | |9:00-9:10 | Welcome| |9:00-9:45 | Linear regression| |9:45-10:00 | Break| |10:00-10:45 | Non-linear regression| |10:45-11:00 | Break| |11:00-11:45 | Non-linear regression and Fourier methods| |11:45-12:00 | Conclusions and outlook| --- ### Instructors and helpers Instructors and helpers: - Anastasiia Andriievska, ENCCS/RISE - David Eklund, ENCCS/RISE - Thor Wikfeldt, ENCCS/RISE - Yonglei Wang, ENCCS/LiU --- ## Ice breaking question What type of projects to you want to use Julia and HPDA methods on? - first_name: answer - Thor: - Zhuojun(Alex): Psychology, seeking the distance metric of manifolds in LLM - Christos : animal genetics - Etsuko: Ecology. I want to learn an alternative to Matlab to do my modeling studies and run simulations on HPC clusters. - Hannes: Develop powerful reinforcement learning methods using big data. - Vladimir: Data analysis within Telco data. - Peter: Data driven process monitoring and analytics for shop floor - Nazib M Seidu: Data Analytics - Josefin: animal behavior classification, proteomics data analysis - David: I want to use Julia for hybrid modelling and physics informed machine learning. - Ana: I want to use Julia for scientific computing and data analysis, complex networks analysis - Bert Tijskens, University of Antwerp, Flemish Supercomputer center. level 3 support: code modernization/Parallel programming/... - Andrés: Price forecasting and data handling that are computationally expensive - Oleksandr: Data analysis, machine learning - Ashenafi : Deep learning in drug discovery - Darja: Data analytics, data science, computationaly more intensive tasks that python is too slow for - Luis: CFD turbulent flows and combustion processes. - Kalle Prorok: Using Julia for Reinforcement Learning/AI and (medical) Data Analysis - Sabine: for experimental data postprocessing - Jona: image processing of red blood cell in blood vessels. - Constanza: Improve performance of multi-species models and their analysis. --- ### Code of Conduct We strive to follow the [Contributor Covenant Code of Conduct](https://www.contributor-covenant.org/version/2/1/code_of_conduct/) to foster an inclusive and welcoming environment for everyone. [![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.0-4baaaa.svg)](https://github.com/ENCCS/event-organisation/blob/main/CODE_OF_CONDUCT.md) In short: - Use welcoming and inclusive language - Be respectful of different viewpoints and experiences - Gracefully accept constructive criticism - Focus on what is best for the community - Show courtesy and respect towards other community members Contact details to report CoC violations can be [found here](https://enccs.se/kjartan-thor-wikfeldt). --- :::danger You can ask questions about the workshop content at the bottom of this page. We use the Zoom chat only for reporting Zoom problems and such. ::: --- ## Questions, answers and information ### Day 1 - https://enccs.github.io/julia-intro/ - is this how to ask a question? - yes, and an answer will appear like so! - I'm confused that I had to register an account on HackMD. Is there no option to edit withou logging in? (It was a minute ago, but thanks for clarification) - Installation is thought for Windows OS? In Linux it couldn't resolve some dependencies ( I switched for Windows install and it had no problems ) - I had some problem with depednencies also, but it was resolved after some dancing. - I have issues using the Project.toml - Matlab also has JIT, right? How is Julia better than Matlab with respect to JIT? - Yes, both Julia and MATLAB use Just-In-Time (JIT) compilation, but they differ in their performance. Julia's JIT compilation allows for impressive performance, with code that runs close to the speed of C. This gives Julia an edge when dealing with computationally intensive tasks. Julia was designed to be fast right out of the box, without the need for additional toolboxes or libraries. On the other hand, MATLAB, while reliable, may require additional toolboxes for optimized performance in specific tasks. - In a specific comparison of a simple evaluation task, Julia was found to be significantly faster than MATLAB. When both MATLAB and Julia code were implemented simply by for-loop (single-threaded), Julia was about 2x faster. Even when MATLAB code was vectorized and run on an 8-core CPU, the Julia vectorized for-loop was still faster (about 1.25x). Finally, highly optimized Julia code was 2x faster than MATLAB's vectorized code. https://www.mathworks.com/matlabcentral/answers/1582569-matlab-is-significantly-slower-than-julia-on-simple-evaluation - How does it know this is REPL? - https://stackoverflow.com/questions/59221798/julia-find-out-if-run-from-repl-or-command-line - julia> isinteractive() true - After the the long TTFP for first run and you edit the code? Does it take another long TTFP? - It should recompile only what is necessary. Thor can comment further. - once packages that your code relies on are compiled (in the JIT fashion), everything runs super fast. But sometimes Julia needs to recompile a function - this is when you call it with new argument types (e.g. Ints instead of Floats), and Julia needs to on-the-fly compile a new method for those types - Is Julia at point where can it be deployed as production program in industry? Or is it still as an research focus language or just executing it locally - Yes, Julia has reached a point where it can be deployed as a production program in the industry. According to Bogumił Kamiński, a professor at SGH Warsaw School of Economics and a significant contributor to the Julia language and its ecosystem, Julia is finally production-ready. https://www.infoq.com/news/2020/08/julia-production-ready/ - Julia is used by many companies today - The tuples are they similar to list items? - https://www.freecodecamp.org/news/python-tuple-vs-list-what-is-the-difference/ - How is this different from what Thor just showed? ``` M = [a,b] 2-element Vector{Vector{Int64}}: [1, 2, 3, 4, 5] [1, 4, 9, 16, 25] ``` - Can you show me how to do the convolution symbol again? - for mathematical symbols one uses LaTeX syntax. In this case for example, "\circ" and then hit Tab key - So despite M being a matrix, the eachindex() gives a one-dimensional indexing? - Yes, that's correct. In Julia, the `eachindex()` function provides a way to iterate over the indices of an array. When used with a matrix (or any multidimensional array), `eachindex()` will still produce a one-dimensional iterable object. For example, consider a 2x2 matrix `M`: ```julia M = [1 2; 3 4] ``` If you use `eachindex(M)`, it will produce an iterable that goes from 1 to 4 (since `M` has 4 elements). Here's how you can use it: ```julia for i in eachindex(M) println(i) end ``` This will print: ``` 1 2 3 4 ``` So, even though `M` is a matrix, `eachindex(M)` gives a one-dimensional indexing. This can be useful in many situations where you need to iterate over all elements of a multidimensional array in their native storage order. - Thanks a lot! - How can we clear variables and functions from the memory? - this is actually not possible as far as i know. One needs to quit and restart Julia - How can this be tolerated?! I always want to keep the workspace neat and clean when I use Matlab... - i understand that one gets used to this in Matlab and it becomes a natural part of one's workflow. I think there might be technical limitation in Julia for this to be possible, but i think Julia developers simply get used to it. Personally I don't consider it a big problem - Column major - After all these definitions of "f", how can I clear this name? (I.e. I want to forget the other definitions) - i think you simply have to quit and restart julia - I am just an ecologist and learned programming on my own exclusively in Matlab, so I learn as I go. Can someone recommend a good tutorial page where I can learn how to do Exception handling/error handling and writing testing code? How do you formally do it? Thanks. - In case of Julia, Thor will go through this a bit later today in https://enccs.github.io/julia-intro/development/. - What about in general, if such a thing exists? - [How to handle exception in Julia - Stack Overflow](https://stackoverflow.com/questions/48408145/how-to-handle-exception-in-julia.) provides a good introduction to exception handling in Julia. - [Exception handling in Julia - GeeksforGeeks](https://www.geeksforgeeks.org/exception-handling-in-julia/) is another resource that explains how to handle exceptions in Julia. - [Unit Testing - The Julia Language](https://docs.julialang.org/en/v1/stdlib/Test/) is the official documentation for unit testing in Julia. - I can I see what the functions actually are after entering them? If I use methods(), I only see f(x), f(x,y), etc, not f(x) = x + y - one can get access to a "processed" version of your function with so called code introspection macros, as we'll soon see. Try for example to do `@code_lowered f(4)` - there seems to be a workaround to see the actual definition of the function using the Debugger package: https://discourse.julialang.org/t/how-do-i-see-whats-inside-a-function/88599/3 ### https://enccs.github.io/julia-intro/overview/ - I am sorry, but what is type Point? What kind of type is it? - the "Point" type is just a silly example to demonstrate what you can use Composite Types for - What are vectors and how are they used in programming? https://stackoverflow.com/questions/508374/what-are-vectors-and-how-are-they-used-in-programming - What is a Floating-point? - Computer Hope. https://www.computerhope.com/jargon/f/floapoin.htm - Point is type we defined oursevles (Composite Type). It has two fields x and y that represent coordinates of a point. - - Can we (accidentaly, perhaps) create an array with elements of the different types? - There is the type Any which allows this. julia> A = ["hello", 2] 2-element Vector{Any}: typeof(A[2]) is Int64 and typeof(A[1]) is String - that's indeed possible but will results in slow code, i believe - Can we write like this, return 0.0? - if this relates to the type-instability discussion: it's possible, but then you would get the same problem when passing in negative or positive *integers* - Is this feature similar to the decorator in Python? - yes, macros in Julia are similar to decorators in Python - How to see plots like the ode-solver in VS Code? (I can paste the code in REPL but not useful if many source files) - The plot should appear in a separate tab. Did you get any error messeage? Try restarting VS Code. - if I copy & paste the code into the terminal window repl I get this extra tab with the plot shown but not by just Ctrl-F5 or triangle-icon but Thor mentioned plots probably can be saved into files and examined later for long runs with many source-files - You can use savefig("figurename.png") to save the figure. - Thank you worked ok but very low resolution, added dpi for better: sol = solve(prob, Tsit5(), reltol = 1e-6) plot(sol.t, getindex.(sol.u, 2), label = "Numerical",dpi=1000) savefig("sinerh.png") - Ok, good! The file format may play a role. A bunch of them seems to supported (eps, pdf,...) https://docs.juliaplots.org/latest/output/. - I tried ; in the REPL on Windows. It opens a "shell" but does not seem to behave like Windows powershell or cmd. - I had to install the Revise package in order to use includet. ### Exercise session until 11:20 CET ### Break until 11:35 CET ### [Developing in Julia](https://enccs.github.io/julia-intro/development/) - While running includet("Points.jl"), the error below comes up? ``` "ERROR: UndefVarError: `includet` not defined Stacktrace: [1] top-level scope" ``` - Thor: my mistake, this most likely means you have to install Revise first! Go to your package manager (in the root environment), and do `add Revise`. Then maybe you need to restart VSCode so that the Julia-extension becomes aware that Revise is available - I have errors. How can I navigate in VSCode to the folder I saved my test_julia.jl and Points.jl? ``` shell> cd Documents/ ERROR: IOError: cd("Documents/"): permission denied (EACCES) Stacktrace: [1] uv_error @ ./libuv.jl:100 [inlined] [2] cd(dir::String) @ Base.Filesystem ./file.jl:91 [3] repl_cmd(cmd::Cmd, out::Base.TTY) @ Base ./client.jl:54 [4] top-level scope @ none:1 ``` - interesting... I don't know why it gives Permission denied but will try to find a solution - I think you need to install the Revise package. If you run using Revise what happens? - No, I think simply I want to navigate from the current directory to the folder where test_julia.jl and Points.jl are saved. - Ok, great! - it could be that you get permission denied on directories which are "above" the one you started in originally. - It worked for me after Add Revise and then using Revise in the source code - Thanks. I closed VSCode and navigated the directory before I started julia. I ran Points.jl and test_julia.jl, and got this error. Is it still about Revise? But the section was before the Revises section. ``` ERROR: UndefVarError: `jl` not defined Stacktrace: [1] getproperty(x::Module, f::Symbol) @ Base ./Base.jl:31 [2] top-level scope @ ~/Documents/GitHub/Julia_ESCCS/testing_julia.jl:2 ``` - Thor: i don't see how Revise could be involved in this particular problem - is this still a problem?? --> YES; I will look into it tonight! ## Quick feedback What was good or worked well today or was interesting? - The environment setting is awesome. +2 - HackMD & the Notes are nice and have a great structure +1 - Great way to be immediately productive for people who come from Python or other languages - Installing packages - Integration with Visual Studio +1 - Very comprehensive intro - Great take on the "software developing in Julia", simple and clear. - the structure of the course was good in terms of timing and content - It was nice to learn about types. The lesson was clear and explains the concept well. - Good pace, time was just enough for me to start the excercises. But I think it's important to start them online, because once started they are easier to come back to after lunch -- so timing is just right. - HackMD Q&A was greatly helpful. I was kind of overwhelmed with this format in previous SNIC workshops, but today it worked well for me. I don't know why. What could be improved or was less interesting? - The training material is very clear for me to follow. - The exercise on writing functions were a bit detail for beginners. - For me, it was difficult to wrap my head around computer science jargon at the same speed... and I got stuck when I got errors. I am still stuck... - Pace is a bit fast, but honestly within 3 hours is understandable - I am slower on the exercises, but it is fine if I do it later +1 - That absence of Revise package in the standard installation was an unpleasant surprise which hold me stuck for a bit. - Maybe add something about debugging (it was very slow due to interpreted code but there is now an experimental version with compiled code) - - So this is not feedback, but how do you get Julia to run i VScode using conda environments? I can run scripts through the terminal integrated in VScode, but it doesn't work to use the julia: Execute File in REPL - Thor: i'm not sure about that, will need to research. Do you need it for today? - Nah, if it is not mandatory to use VScode, there's no big issue. --- ### Day 2 - This is a question? - Answer: yes. https://enccs.github.io/julia-for-hpda/motivation/ - The sex summary in describe(df), is there a way to have it differently? ´´´´ julia> describe(df) 7×7 DataFrame Row │ variable mean min median max nmissing eltype │ Symbol Union… Any Union… Any Int64 DataType ─────┼─────────────────────────────────────────────────────────────────────────── 1 │ species Adelie Gentoo 0 String15 2 │ island Biscoe Torgersen 0 String15 3 │ bill_length_mm 43.9928 32.1 44.5 59.6 0 Float64 4 │ bill_depth_mm 17.1649 13.1 17.3 21.5 0 Float64 5 │ flipper_length_mm 200.967 172 197.0 231 0 Int64 6 │ body_mass_g 4207.06 2700 4050.0 6300 0 Int64 7 │ sex female male 0 String7 ´´´´ - Keyword arguments can be added after `;`: ```julia function greet_dog(; greeting = "Hi", dog_name = "Fido") # note the ; println("$greeting $dog_name") end greet_dog(dog_name = "Coco", greeting = "Go fetch") # "Go fetch Coco" ``` - Why do I get this error? ``` ... julia> Pkg.add("PalmerPenguins") ERROR: UndefVarError: `Pkg` not defined Stacktrace: [1] top-level scope @ REPL[1]:1 ... ``` - you first need to import Pkg: `using Pkg` - What does this notation `` .= `` mean? - assignment of a several values in a vector. For example if v is 3-vector and I write v .= 5, then v will be [5,5,5]. - I had this error? ``` julia> plot(x,y, title = "Two Lines", label = ["Line 1" "Line 2"], lw = 3) Error showing value of type Plots.Plot{Plots.GRBackend}: ERROR: could not load library "libGR.dll" The specified module could not be found. ``` - this seems to e a little tricky but hopefully possible to solve. Here are some solutions: https://github.com/JuliaPlots/Plots.jl/issues/1720 - first try going to your package manager (hit `]`) and then build GR manually: `build GR` - Thanks, will look at the documentation as well. When i tried `build GR` got similar error - if the problem with the GR plotting backend persists, one can choose a different backend: https://docs.juliaplots.org/latest/backends/ - for example, you could try: `using Plots; pythonplot()` ``` (@v1.9) pkg> build GR Building GR → `C:\Users\nazsei\.julia\scratchspaces\44cfe95a-1eb2-52ea-b672-e2afdf69b78f\1185d50c5c90ec7c0784af7f8d0d1a600750dc4d\build.log` ERROR: Error building `GR`: [ Info: Downloading pre-compiled GR 0.49.0 Windows binary ERROR: LoadError: IOError: could not spawn `'C:\Users\nazsei\AppData\Local\Programs\Julia-1.9.3\bin\..\libexec/7z' x downloads/gr-0.49.0-Windows-x86_64.tar.gz -y`: no such file or directory (ENOENT) ``` - While running "df_wide = unstack(df_long, :species, :variable, :value)", the error comes. ``` ERROR: ArgumentError: Duplicate entries in unstack at row 2 for key (InlineStrings.String15("Adelie"),) and variable island. Pass `combine` keyword argument to specify how they should be handled. ``` - seems you need to add the `combine` keyword, try `combine=only` or `combine=last` - My computer is still installing Plots. Should I buy a new computer? It is MacBook Pro 4 core from 2017. - Plots is a massive library so it's normal for it to take time - What are we supposed to do with the error with gr()? ''' julia> gr() ERROR: UndefVarError: `gr` not defined Stacktrace: [1] top-level scope-- @ REPL[41]:1 ''' - Is Plots installed?--> YES - is Plots loaded? `using Plots` --> YES - Will I get the plot popping out even though I am using Julia in Terminal? But anyway, I only get errors. - This question is more general one, is it possible to get the access to Hadoop cluster with ENCCS, and how can we apply for it? - . - When i run @df df density() I get "ArgumentError: quantiles are undefined in presence of NaNs or missing values" - I guess I have to remove NaNs..then it worked :) - How did it work? - How did you remove NaNs? I did the interpolation trick, but got the same error. Still unfixed. - via dropmissing!(df) ### Exercsises until 10:45. #### Use main room if you encounter issues and Hackmd. in case of weird error messages with Plots and GR, try: ```julia= Pkg.rm("GR") Pkg.rm("Plots") Pkg.add("Plots") Pkg.add("GR") ``` and then restart your Julia session - What does ``:`` mean in front of the names of columns? I.e. in the example where we create the ``create_plot`` function - In Julia, the colon `:` before a name is used to create a `Symbol`. A `Symbol` is a type used in scenarios requiring high performance, like indexing into a DataFrame. - When you're working with DataFrames in Julia, column names are often represented as `Symbols`. So, if you have a DataFrame `df` and you want to access the column named `column_name`, you would use `df[:, :column_name]`. - In the context of the `create_plot` function, if there's a line like `plot!(df[:, :column_name])`, it means that the function is adding a new series to the current plot using the data in the `column_name` column of DataFrame `df`. - Cool! Thank you very much! - Judging from the syntax, do I understand correctly that ``Matrix`` is not just a vector of ``Vectors``, but a stand-alone type? What about multidimensional matrices? Are they some generalization of a ``Matrix`` or a ``Vector``? - as far as i know these are run names for different types of Arrays - 1D Array is a Vector, 2D is a Matrix, higher dimensional are just "Arrays" - Can someone explain what "adjoint" means here? ``` julia> v = [1,2,3] 3-element Vector{Int64}: 1 2 3 julia> v' 1×3 adjoint(::Vector{Int64}) with eltype Int64: 1 2 3 ``` - it's similar to a transpose. Will ask David for a more mathematically rigorous explanation! - For a real matrix it is the transpose of that matrix. As shown in the lecture, for a complex matrix A=[1 1+im;1+2*im 1] you will get that A' is [1 1-2im, 1-1im 1] transposed and conjugated (imaginary parts flipped sign). This is called Hermitian transpose or conjugate adjoint. - From a more theoretical point of view, the adjoint of a linear map X -> Y is used in algebra and functional analysis and it turns out the the conjugate transpose has better properties than the transpose (for complex matrices). - Is a vector always a row vector in Julia? - by default vectors are column vectors in Julia - You can make a columnvector so v=[1;2;3;4] - More general question? In Julia, is it possible for the REPL to store an object defined., i.e., a data frame unless you want to to see the output of the data frame. Example Xp will not show after creating the Matrix - X = [1 2;3 4]; will supress output written on screen. ``` Xp = X*P[:,2:4] 150×3 Matrix{Float64}: -0.0279148 0.319397 2.68413 ```` - I did the "3D" scatter plots of Irises, is there a way of doing interactive plots to rotate the plot manually to view/understand it better? Noticed there is another interactive plot library? - Yes, you can create interactive 3D scatter plots in Julia using the PlotlyJS package. - Yes, thank you! - https://stackoverflow.com/questions/54429429/3d-scatter-plot-in-julia - https://plotly.com/julia/3d-scatter-plots/ - nice, although I got "The WebIO Jupyter extension was not detected." - so some more installing (and uninstalling to avoid conflicts with existing identifiers) - useful - Do we need to set the random seed before running the PCA or other algorithms? And how? - The only thing that is random the data generation. Do you get the same data every time you run it? - Anyone knows how to set the random seed in Julia? - it depends on the package! Packages that rely on random seeds usually have a function to set the seed - have a look at the documentation of `rand()` (type `?rand` in your REPL). There you can see how to set the randomness algorithm - to set the base random seed: ```julia using Random Random.seed!(3) ``` https://enccs.github.io/julia-for-hpda/data-science/ ### Day 3 - Had this error when saved the Penguins data file to JLD format and tried loading it back? ``` julia> save("penguins.jld", "df", df) julia> df = load("penguins.jld", "df") ┌ Warning: type PooledArrays.PooledArray{InlineStrings.String15,Core.UInt32,1,Core.Array{Core.UInt32,1}} not present in workspace; reconstructing └ @ JLD C:\Users\nazsei\.julia\packages\JLD\S6t6A\src\jld_types.jl:697 ┌ Warning: type InlineStrings.String15 not present in workspace; reconstructing └ @ JLD C:\Users\nazsei\.julia\packages\JLD\S6t6A\src\jld_types.jl:697 ┌ Warning: type JLD.AssociativeWrapper{InlineStrings.String15,Core.UInt32,Base.Dict{InlineStrings.String15,Core.UInt32}} not present in workspace; reconstructing └ @ JLD C:\Users\nazsei\.julia\packages\JLD\S6t6A\src\jld_types.jl:697 Error encountered while load FileIO.File{FileIO.DataFormat{:JLD}, String}("penguins.jld"). Fatal error: ERROR: MethodError: Cannot `convert` an object of type JLD.var"##PooledArrays.PooledArray{InlineStrings.String15,Core.UInt32,1,Core.Array{Core.UInt32,1}}#292" to an object of type AbstractVector Closest candidates are: convert(::Type{T}, ::LinearAlgebra.Factorization) where T<:AbstractArray @ LinearAlgebra C:\Users\nazsei\AppData\Local\Programs\Julia-1.9.3\share\julia\stdlib\v1.9\LinearAlgebra\src\factorization.jl:59 convert(::Type{T}, ::T) where T<:AbstractArray @ Base abstractarray.jl:16 convert(::Type{T}, ::T) where T @ Base Base.jl:64 ``` - Thor: i get the same error and don't immediately know what's wrong. Let's try to figure this out... - David: I get this as well. We figure it out today. - The size of the `csv` is much smaller than that of the `jld` - which is no surprise for this dataset. Is there some kind of heuristic when `jld` becomes beneficial? ```bash= shell> ls -alrth -rw-rw-r-- 1 x x 14K Okt 4 09:07 penguins.csv -rw-rw-r-- 1 x x 470K Okt 4 09:07 penguins.jld ``` - While csv files are commonly used, reading and writing them may become bottlenecks for large datasets. The HDF5 format (https://github.com/JuliaIO/HDF5.jl) is about 500x faster, but the file is not human readable, so you need a tool to edit it. - Do it is more about read/write speed than size? - I get this error in julia ``` julia> jupyter-lab ERROR: UndefVarError: `jupyter` not defined Stacktrace: [1] top-level scope @ REPL[24]:1 ``` - i believe Anastasiia is showing how to work with a jupyter notebook **inside VSCode** - To use Jupyter Lab with Julia, you need to use the `IJulia` package. Here's how you can do it: ```julia using Pkg Pkg.add("IJulia") ``` - how do you make the code block (gray box)? - you use three backticks - If you want to add a new block you can do this in VSCode by clicking the `+ Code` on the top left, and similar in Jupyter or JupyterLab - I got an error when @vlplot( mark={ :g.. : TypeError: Cannot read properties of undefined (reading 'cb_2015_california_county_20m') at topojson (C:\Users\kalle\.julia\artifacts\793fcbdad1beb02a41d65f455c195c99cc2f3c1d\node_ Javascript Error: Cannot read properties of undefined (reading 'cb_2015_california_county_20m') - when downloading with "https://raw.githubusercontent.com/ENCCS/julia-for-hpda/main/content/data/california-counties.json" instead of https://github.com/ENCCS/julia-for-hpda/blob/main/content/data/california-counties.json it now works :) - Can someone briefly explain or give a pointer to some tutorials on how "environment" works and relates to the "source activate" function and how VSCode, jupyter notebook, Julia, and Python interact or not interact? I got error messages in VSCode yesterday and do not know what is wrong. Because I don't know what is interacting with what and in what ways, I have no clue how to understand my problems. - https://enccs.github.io/julia-intro/development/#environments - https://pkgdocs.julialang.org/v1/environments/ - in principle there should be no overlap or interaction between python environments and julia environments - What was the solution to this error in the end?: ```bash= TypeError: Cannot read properties of undefined (reading 'cb_2015_california_county_20m') - I solved it by reading in raw from github via: download("https://raw.githubusercontent.com/ENCCS/julia-for-hpda/main/content/data/california-counties.json", "california-counties.json") ``` - When i tried download the clustering codes. I got the message? ``` julia-for-hpda/notebooks /Clustering.ipynb Sorry, this is too big to display. ``` - that's okay! simply means that GitHub doesn't want to render large notebooks, but you can still download the notebook with the download button in top right corner - I get this error when trying to open the notebook in Jupyter. ``` Notebook could not be converted from version 1 to version 2 because it's missing a key: cells" ``` Is there a clear solution to this? (OK, I had to download the file by clicking the download button on GitHub and then the file is correct. ``wget`` doesn't work with this link unfortunately.) - Good to know! - In relation to K-Means this could be of values e.g allowing to run K-Means++ https://juliastats.org/Clustering.jl/dev/init.html#Seeding - Yes, thanks! - And why does the findaccuracy result differ in Classification? - Do you mean that is differs when the model is changed? Or differs in some other situation? The different model will have very varying accuracy on a given problem. Some more info would help me here to answer the question. --- ### Day 4 Material: https://enccs.github.io/julia-for-hpda/regression/ - questions continue here - What does ``@`` sign mean in the ``@formula(...)``? Is it a key word, or a variable or smth else? - everything with `@` is a *macro*. There are many inbuilt macros in Julia (like `@time`, `@show`, etc., see https://enccs.github.io/julia-intro/overview/), but Julia packages sometimes define new macros - macros are generally similar to functions, but they can modify the behaviour of other code like a decorator in Python - Oh, I see. Thanks! So ``lm`` requires us to supply our model wrapped in this ``@formula`` format, right? - In the Polynomial model, we don't write example how we did in the linear model by first writing fit(LinearModel)? - We used the alternative function lm(@formula(), df). This is the same as fit(LindearModel, @formula(),df). It is just a wrapper/short hand. - Also, does the order of the variables matter, if we decide to start from the lower, i.e., 1 + cX + .... + cX^5? ``` lm3 = lm(@formula(cy ~ cX^5 + cX^4 + cX^3 + cX^2 + cX + 1), df) ``` - In the context of regression models, the order of the variables in your formula does not affect the model's predictions. The regression model estimates the coefficients based on the data, not on the order of the variables in the formula. However, it's important to note that the order can affect the interpretation of the coefficients in certain cases, such as hierarchical regression models. But in your case, where you're fitting a polynomial regression model, changing the order of terms won't affect the results. So, whether you start with `1 + cX + .... + cX^5` or `cX^5 + cX^4 + cX^3 + cX^2 + cX + 1`, it won't make a difference to the model's predictions. The estimated coefficients should be the same in both cases. Remember that each term `cX^n` represents a different feature and these features are independent of each other in terms of their order in the formula. The `lm` function will estimate a separate coefficient for each feature based on its relationship with the target variable `cy`, regardless of where it appears in the formula. - THis the it matter that we took x^5? - The degree of the polynomial matters. Yes this is very important. If you start too low you will get a bad fit. If too high, more computationally expensive and can overfit. - Question that was asked: can one do multiple comparison in GLM in Julia? - I actually don't think GLM has this implemented. But the documentation is not the most extensive so it might be there but a bit hard to find. - I like this presentation on the neural networks a lot, can we download the slides? - Thanks! We will make these available somewhere. Will come back on where. - Cool! Thanks a lot! - I managed to make a pdf of the notes and put it in the chat. Will still upload somewhere for the course. - I have uploaded the slides to the github repository for the course at https://github.com/ENCCS/julia-for-hpda/tree/main/content/slides You should be able to clone the repository or download the pdf-slides only. - I have also added a link in the lecture notes where you can download the slides: https://enccs.github.io/julia-for-hpda/regression/#climate-data - In ``MLJ.fit!`` do you need to specify the number of epochs? - - In general there epochs are relevant for neural networks and similar but not for decision trees, random forest and other models. In case of neural networks should be able to specify number of epochs but don't know on the top of my head how the syntax goes. -I see, thanks! - Sure! - From Zoom chat: fyi: when trying the trigonometric example with a polinomial fit you see nicely how the degree influences the fit (higher is not always better) - Does it depend on the highest order being even or odd? Because, since it is cosines, only even powers of x should contribute - The higher order terms dominate the interpolation (lower order become 0) and make a very nice fit on the start and the end but in the middle it is bad. - only even is not helping - David: Thank you for the nice question and remarks! - Indeed the odd power coefficients get small when fitting. Cosine is an even function (graph symmetric around x=0) as you write. The Taylor polynomial around 0 will have even powers. - Numerical instability seems to set in at higher degrees (10 or something) and the fit is not very good. You can do the following to try something that behaves better: make it easier and take the interval [-3,3] instead of [-6,6]. Now, a quartic polynomial (degree 4) is an ok fit and a sectic (degree 6) quite good. An octic (degree 8) is a very tight fit. The results are similar for say the interval [0,6] while the interval [0,12] is though like [-6,6]. - This I guess illustrates a more realistic usage of polynomial interpolation: approximate your data locally with a polynomial. So split the domain in pieces and use a different polynomial on each piece. Then glue them together to make sure they match on the boundaries where the pieces meet (to have the same values and derivatives for example). This is similar to spline curves and higher dimensional analogs (if we had more than one input variable). :::info *Always ask questions at the very bottom of this document, right **above** this.* ::: ----